Deprecation Notice: Airbnb has merged a PR into upstream that updates the gnista dep, rendering this fork obsolete. I'd recommend using their official version again.
Hammerspace
Hash-like interface to persistent, concurrent, off-heap storage
Notice: This is a forked release of hammerspace, which has been released to RubyGems as hammerspace-fork
. It is upstream hammerspace plus my PR to correct the gnista dependency
What is Hammerspace?
Hammerspace ... is a fan-envisioned extradimensional, instantly accessible storage area in fiction, which is used to explain how animated, comic, and game characters can produce objects out of thin air.
This gem provides persistent, concurrently-accessible off-heap storage of strings with a familiar hash-like interface. It is optimized for bulk writes and random reads.
Motivation
Applications often use data that never changes or changes very infrequently. In many cases, some latency is acceptable when accessing this data. For example, a user's profile may be loaded from a web service, a database, or an external shared cache like memcache. In other cases, latency is much more sensitive. For example, translations may be used many times and incurring even a ~2ms delay to access them from an external cache would be prohibitively slow.
To work around the performance issue, this type of data is often loaded into the application at startup. Unfortunately, this means the data is stored on the heap, where the garbage collector must scan over the objects on every run (at least in the case of Ruby MRI). Further, for application servers that utilize multiple processes, each process has its own copy of the data which is an inefficient use of memory.
Hammerspace solves these problems by moving the data off the heap onto disk. Leveraging libraries and data structures optimized for bulk writes and random reads allows an acceptable level of performance to be maintained. Because the data is persistent, it does not need to be reloaded from an external cache or service on application startup unless the data has changed.
Unfortunately, these low-level libraries don't always support concurrent
writers. Hammerspace adds concurrency control to allow multiple processes to
update and read from a single shared copy of the data safely. Finally,
hammerspace's interface is designed to mimic Ruby's Hash
to make integrating
with existing applications simple and straightforward. Different low-level
libraries can be used by implementing a new backend that uses the library.
(Currently, only Sparkey is supported.)
Backends only need to implement a small set of methods ([]
, []=
, close
,
delete
, each
, uid
), but can override the default implementation of other
methods if the underlying library supports more efficient implementations.
Installation
Requirements
- Gnista, Ruby bindings for Sparkey
- Sparkey, constant key/value storage library
- Snappy, compression/decompression library (unused, but required to compile Sparkey)
- A filesystem that supports
flock(2)
and unlinking files/directories with outstanding file descriptors (ext3/4 will do just fine)
Installation
Add the following line to your Gemfile:
gem 'hammerspace'
Then run:
bundle
Vagrant
To make development easier, the source tree contains a Vagrantfile and a small cookbook to install all the prerequisites. The vagrant environment also serves as a consistent environment to run the test suite.
To use it, make sure you have vagrant installed, then:
vagrant up
vagrant ssh
bundle exec rspec
Usage
Getting Started
For the most part, hammerspace acts like a Ruby hash. But since it's a hash that persists on disk, you have to tell it where to store the files. The enclosing directory and any parent directories are created if they don't already exist.
h = Hammerspace.new("/tmp/hammerspace")
h["cartoons"] = "mallets"
h["games"] = "inventory"
h["rubyists"] = "data"
h.size #=> 3
h["cartoons"] #=> "mallets"
h.map { |k,v| "#{k.capitalize} use hammerspace to store #{v}." }
h.close
You should call close
on the hammerspace object when you're done with it.
This flushes any pending writes to disk and closes any open file handles.
Options
The constructor takes a hash of options as an optional second argument.
Currently the only option supported is :backend
which specifies which backend
class to use. Since there is only one backend supported at this time, there is
currently no reason to pass this argument.
h = Hammerspace.new("/tmp/hammerspace", {:backend => Hammerspace::Backend::Sparkey})
Default Values
The constructor takes a default value as an optional third argument. This
functions the same as Ruby's Hash
, except with Hash
it is the first
argument.
h = Hammerspace.new("/tmp/hammerspace", {}, "default")
h["foo"] = "bar"
h["foo"] #=> "bar"
h["new"] #=> "default"
h.close
The constructor also takes a block to specify a default Proc, which works the
same way as Ruby's Hash
. As with Hash
, it is the block's responsibility to
store the value in the hash if required.
h = Hammerspace.new("/tmp/hammerspace") { |hash, key| hash[key] = "#{key} (default)" }
h["new"] #=> "new (default)"
h.has_key?("new") #=> true
h.close
Supported Data Types
Only string keys and values are supported.
h = Hammerspace.new("/tmp/hammerspace")
h[1] = "foo" #=> TypeError
h["fixnum"] = 8 #=> TypeError
h["nil"] = nil #=> TypeError
h.close
Ruby hashes store references to objects, but hammerspace stores raw bytes. A
new Ruby String
object is created from those bytes when a key is accessed.
value = "bar"
hash = {"foo" => value}
hash["foo"] == value #=> true
hash["foo"].equal?(value) #=> true
hammerspace = Hammerspace.new("/tmp/hammerspace")
hammerspace["foo"] = value
hammerspace["foo"] == value #=> true
hammerspace["foo"].equal?(value) #=> false
hammerspace.close
Since every access results in a new String
object, mutating values doesn't
work unless you create an explicit reference to the string.
h = Hammerspace.new("/tmp/hammerspace")
h["foo"] = "bar"
# This doesn't work like Ruby's Hash because every access creates a new object
h["foo"].upcase!
h["foo"] #=> "bar"
# An explicit reference is required
value = h["foo"]
value.upcase!
value #=> "BAR"
# Another access, another a new object
h["foo"] #=> "bar"
h.close
This also imples that strings "lose" their encoding when retrieved from hammerspace.
value = "bar"
value.encoding #=> #<Encoding:UTF-8>
h = Hammerspace.new("/tmp/hammerspace")
h["foo"] = value
h["foo"].encoding #=> #<Encoding:ASCII-8BIT>
h.close
If you require strings in UTF-8, make sure strings are encoded as UTF-8 when storing the key, then force the encoding to be UTF-8 when accessing the key.
h[key] = value.encode('utf-8')
value = h[key].force_encoding('utf-8')
Persistence
Hammerspace objects are backed by files on disk, so even a new object may already have data in it.
h = Hammerspace.new("/tmp/hammerspace")
h["foo"] = "bar"
h.close
h = Hammerspace.new("/tmp/hammerspace")
h["foo"] #=> "bar"
h.close
Calling clear
deletes the data files on disk. The parent directory is not
removed, nor is it guaranteed to be empty. Some files containing metadata may
still be present, e.g., lock files.
Concurrency
Multiple concurrent readers are supported. Readers are isolated from writers, i.e., reads are consistent to the time that the reader was opened. Note that the reader opens its files lazily on first read, not when the hammerspace object is created.
h = Hammerspace.new("/tmp/hammerspace")
h["foo"] = "bar"
h.close
reader1 = Hammerspace.new("/tmp/hammerspace")
reader1["foo"] #=> "bar"
writer = Hammerspace.new("/tmp/hammerspace")
writer["foo"] = "updated"
writer.close
# Still "bar" because reader1 opened its files before the write
reader1["foo"] #=> "bar"
# Updated key is visible because reader2 opened its files after the write
reader2 = Hammerspace.new("/tmp/hammerspace")
reader2["foo"] #=> "updated"
reader2.close
reader1.close
A new hammerspace object does not necessarily need to be created. Calling
close
will close the files, then the reader will open them lazily again on
the next read.
h = Hammerspace.new("/tmp/hammerspace")
h["foo"] = "bar"
h.close
reader = Hammerspace.new("/tmp/hammerspace")
reader["foo"] #=> "bar"
writer = Hammerspace.new("/tmp/hammerspace")
writer["foo"] = "updated"
writer.close
reader["foo"] #=> "bar"
# Close files now, re-open lazily on next read
reader.close
reader["foo"] #=> "updated"
reader.close
If no hammerspace files exist on disk yet, the reader will fail to open the files. It will try again on next read.
reader = Hammerspace.new("/tmp/hammerspace")
reader.has_key?("foo") #=> false
writer = Hammerspace.new("/tmp/hammerspace")
writer["foo"] = "bar"
writer.close
# Files are opened here
reader.has_key?("foo") #=> true
reader.close
You can call uid
to get a unique id that identifies the version of the files
being read. uid
will be nil
if no hammerspace files exist on disk yet.
reader = Hammerspace.new("/tmp/hammerspace")
reader.uid #=> nil
writer = Hammerspace.new("/tmp/hammerspace")
writer["foo"] = "bar"
writer.close
reader.close
reader.uid #=> "24913_53943df0-e784-4873-ade6-d1cccc848a70"
# The uid changes on every write, even if the content is the same, i.e., it's
# an identifier, not a checksum
writer["foo"] = "bar"
writer.close
reader.close
reader.uid #=> "24913_9371024e-8c80-477b-8558-7c292bfcbfc1"
reader.close
Multiple concurrent writers are also supported. When a writer flushes its changes it will overwrite any previous versions of the hammerspace.
In practice, this works because hammerspace is designed to hold data that is bulk-loaded from some authoritative external source. Rather than block writers to enforce consistency, it is simpler to allow writers to concurrently attempt to load the data. The last writer to finish loading the data and flush its writes will have its data persisted.
writer1 = Hammerspace.new("/tmp/hammerspace")
writer1["color"] = "red"
# Can start while writer1 is still open
writer2 = Hammerspace.new("/tmp/hammerspace")
writer2["color"] = "blue"
writer2["fruit"] = "banana"
writer2.close
# Reads at this point see writer2's data
reader1 = Hammerspace.new("/tmp/hammerspace")
reader1["color"] #=> "blue"
reader1["fruit"] #=> "banana"
reader1.close
# Replaces writer2's data
writer1.close
# Reads at this point see writer1's data; note that "fruit" key is absent
reader2 = Hammerspace.new("/tmp/hammerspace")
reader2["color"] #=> "red"
reader2["fruit"] #=> nil
reader2.close
Flushing Writes
Flushing a write incurs some overhead to build the on-disk hash structures that
allows fast lookup later. To avoid the overhead of rebuilding the hash after
every write, most write operations do not implicitly flush. Writes can be
flushed explicitly by calling close
.
Delaying flushing of writes has the side effect of allowing "transactions" -- all unflushed writes are private to the hammerspace object doing the writing.
One exception is the clear
method which deletes the files on disk. If a
reader attempts to open the files immediately after they are deleted, it will
perceive the hammerspace to be empty.
h = Hammerspace.new("/tmp/hammerspace")
h["yesterday"] = "foo"
h["today"] = "bar"
h.close
reader1 = Hammerspace.new("/tmp/hammerspace")
reader1.keys #=> ["yesterday", "today"]
reader1.close
# Writer wants to remove everything except "today"
writer = Hammerspace.new("/tmp/hammerspace")
writer.clear
# Effect of clear is immediately visible to readers
reader2 = Hammerspace.new("/tmp/hammerspace")
reader2.keys #=> []
reader2.close
writer["today"] = "bar"
writer.close
reader3 = Hammerspace.new("/tmp/hammerspace")
reader3.keys #=> ["today"]
reader3.close
If you want to replace the existing data with new data without flushing in
between (i.e., in a "transaction"), use replace
instead.
h = Hammerspace.new("/tmp/hammerspace")
h["yesterday"] = "foo"
h["today"] = "bar"
h.close
reader1 = Hammerspace.new("/tmp/hammerspace")
reader1.keys #=> ["yesterday", "today"]
reader1.close
# Writer wants to remove everything except "today"
writer = Hammerspace.new("/tmp/hammerspace")
writer.replace({"today" => "bar"})
# Old keys still present because writer has not flushed yet
reader2 = Hammerspace.new("/tmp/hammerspace")
reader2.keys #=> ["yesterday", "today"]
reader2.close
writer.close
reader3 = Hammerspace.new("/tmp/hammerspace")
reader3.keys #=> ["today"]
reader3.close
Interleaving Reads and Writes
To ensure writes are available to subsequent reads, every read operation implicitly flushes any previous writes.
h = Hammerspace.new("/tmp/hammerspace")
h["foo"] = "bar"
# Implicitly flushes write (builds on-disk hash for fast lookup), then opens
# newly written on-disk hash for reading
h["foo"] #=> "bar"
h.close
While batch reads or writes are relatively fast, interleaved reads and writes are slow because the hash is rebuilt very often.
# One flush, fast
h = Hammerspace.new("/tmp/hammerspace")
h["a"] = "100"
h["b"] = "200"
h["c"] = "300"
h["a"] #=> "100"
h["b"] #=> "200"
h["c"] #=> "300"
h.close
# Three flushes, slow
h = Hammerspace.new("/tmp/hammerspace")
h["a"] = "100"
h["a"] #=> "100"
h["b"] = "200"
h["b"] #=> "200"
h["c"] = "300"
h["c"] #=> "300"
h.close
To avoid this overhead, and to ensure consistency during iteration, the each
method opens its own private reader for the duration of the iteration. This is
also true for any method that uses each
, including all methods provided by
Enumerable
.
h = Hammerspace.new("/tmp/hammerspace")
h["a"] = "100"
h["b"] = "200"
h["c"] = "300"
# Flushes the above writes, then opens a private reader for the each call
h.each do |key, value|
# Writes are done in bulk without flushing in between
h[key] = value[0]
end
# Flushes the above writes, then opens the reader
h.to_hash #=> {"a"=>"1", "b"=>"2", "c"=>"3"}
h.close
Unsupported Methods
Besides the incompatibilities with Ruby's Hash
discussed above, there are
some Hash
methods that are not supported.
- Methods that return a copy of the hash:
invert
,merge
,reject
,select
-
rehash
is not needed, since hammerspace only supports string keys, and keys are effectivelydup
d -
delete
does not return the value deleted, and it does not support block usage -
hash
andto_s
are not overriden, so the behavior is that ofObject#hash
andObject#to_s
-
compare_by_identity
,compare_by_identity?
-
pretty_print
,pretty_print_cycle