Findler: Filesystem Iteration with Persistable State
Findler is a Ruby library for iterating over a filtered set of files from a given path, written to be suitable with concurrent workers and very large filesystem hierarchies.
Usage
f = Findler.new "/Users/mrm"
f.add_extensions ".jpg", ".jpeg"
iterator = f.iterator
iterator.next_file
# => "/Users/mrm/Photos/img_1000.jpg"Cross-process continuations
This should smell an awful lot like hike, except for that last bit.
Findler::Iterator instances can be "paused" and "resumed" with Marshal.
The entire state of the iteration for the filesystem is returned, which can then
be pushed onto any durable storage, like ActiveRecord or Redis, or just a local file:
File.open('iterator.state', 'wb') { |f| Marshal.dump(iterator, f) }To resume iteration:
iterator2 = Marshal.load(File.open('iterator.state', 'rb'))
iterator2.next_file
# => "/Users/mrm/Photos/img_1001.jpg"External synchronization between the serialized state of the
iterator and the other processes will have to be done by you, of course.
The load, next_file , and dump should be done while holding
an iteration mutex of some sort.
Filtering and ordering
Filters provide custom exclusion and ordering criteria, so you don't have to do that logic in the code that consumes from your iterator.
Filters can't be procs or lambdas because those aren't safely serializable.
What you provide to add_filter is a symbolized name of a class method
on Findler::Filters:
f = Findler.new(".")
f.add_filter :order_by_nameNote that the last filter added will be last to order the children, so it will be the "primary" sort criterion. Note also that the ordering is only done in the context of a given directory.
Implementing your own filter
Filter methods receive an array of Pathname instances. Those pathnames will:
- have the same parent
- will not have been enumerated by
next_file()already - will satisfy the settings given to the parent Findler instance, like
include_hiddenand added patterns.
Note that the last filter added will be last to order the children, so it will be the "primary" sort criterion.
The returned values from the class method will be the final set of elements (both files
and directories) that Findler will return from next_file().
Example
To find files that have valid EXIF headers, using the most excellent exiftoolr gem, you'd do this:
require 'findler'
require 'exiftoolr'
# Monkey-patch Filters to add our custom filter:
class Findler::Filters
def self.exif_only(children)
child_files = children.select { |ea| ea.file? }
child_dirs = children.select { |ea| ea.directory? }
e = Exiftoolr.new(child_files)
e.files_with_results + child_dirs
end
end
f = Findler.new "/Users/mrm"
f.add_extensions %w(.jpg .jpeg .cr2 .nef)
f.case_insensitive!
f.add_filter :exif_onlyFilter implementation notes
- The array of
Pathnameinstances can be assumed to be absolute. - Only child files that satisfy the
extensionandpatternfilters will be seen by the filter class method. - If a directory doesn't have any relevant files, the filter method will be called multiple times for a given call to
next_file(). - if you want to be notified when new directories are walked into, and you want to do a bulk operation within that directory, this gives you that hook–-just remember to return the children array at the end of your block.
Why can't filter_with be a proc?
Because procs and lambdas aren't Marshalable, and I didn't want to use something scary like ruby2ruby and eval.
Changelog
0.0.7
- Use a non-inherited set per iterator, rather than a global bloom filter
- Removed the ability to "rescan" due to the weight of the bloom filter in marshalling when traversing an enormous tree.
- Fixed marshal documentation and tests to support ruby 1.9+
0.0.6
-
add_filterstakes an array, not a glob - added tests for order_by_mtime filters
0.0.5
-
add_patternsandadd_extensionstake an array, not a glob
0.0.4
- Added custom filters for
next_file() - Added singular aliases for
add_extensionandadd_pattern
0.0.3
- Switch to minitest
- Gemfile packaging fix
0.0.2
- Added scalable Bloom filter so
Iterator#rescanis possible
0.0.1
- Support for
Marshaling and simple filters