Multisert
Multisert is a buffer that handles bundling up INSERTs, which increases runtime performance.
Installation
Add this line to your application's Gemfile:
gem 'multisert'
And then execute:
$ bundle
Or install it yourself as:
$ gem install multisert
Usage
Let's start with a table:
CREATE TABLE IF NOT EXISTS some_database.some_table (
field_1 int default null,
field_2 int default null,
field_3 int default null,
field_4 int default null
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now let's say we want to insert 1,000,000 records after running the
current iterator through some_magical_calculation
into our table from above.
Let's assume that some_magical_calculation
takes a single integer input and
returns an array of 4 values.
(0..1_000_000).each do |i|
res = some_magical_calculation(i)
dbclient.query %[
INSERT INTO some_database.some_table (field_1, field_2, field_3, field_4)
VALUES (#{res[0]}, #{res[1]}, #{res[2]}, #{res[3]})]
end
This works, but we can improve it's speed by bundling up inserts using
Multisert
:
buffer = Multisert.new connection: dbclient,
database: 'some_database',
table: 'some_table',
fields: ['field_1', 'field_2', 'field_3', 'field_4']
buffer.with_buffering do |b|
(0..1_000_000).each do |i|
res = some_magical_calculation(i)
b << res
end
end
We start by creating a new Multisert instance, providing the database
connection, database and table, and fields as attributes. Next, we leverage
#with_buffering
to wrap our sample iteration. Within the block, we shovel the
results from some_magical_calculation
into the Multisert instance, which then
handles all the heavy lifting in terms of writing to the database.
As an aside, #with_buffering
is handling the following under the hood:
(0..1_000_000).each do |i|
res = some_magical_calculation(i)
buffer << res
end
buffer.write!
As we iterate through, the Multisert instance will build up the records and
then write itself to the specified database table when it hits an internal
count (default is 10_000 entries, but this can be adjusted via the
max_buffer_count
attribute). The buffer.write!
at the end ensures that
any pending entries are written to the database table that were not
automatically taken care of by the auto-write that will kick in during the
iteration.
Insert Strategies
Multisert defaults to using INSERT INTO
on #write!
, but you can set the
insert strategy to REPLACE INTO
or INSERT IGNORE
:
buffer = Multisert.new
#=> would use INSERT INTO on #write! by default
buffer.insert_strategy = :replace
#=> would now use REPLACE INTO on #write!
buffer.insert_strategy = :ignore
#=> would now use INSERT IGNORE on #write!
Performance
Individual vs Buffer
The gem has a quick performance test built in that can be run via:
$ ruby ./performance/multisert_performance_test
We ran the performance test (with some modification to iterate the test 5 times) and receive the following output:
$ ruby ./performance/multisert_performance_test
# test 1:
# insert w/o buffer took 53.37s to insert 100000 entries
# multisert w/ buffer of 10000 took 1.77s to insert 100000 entries
#
# test 2:
# insert w/o buffer took 53.22s to insert 100000 entries
# multisert w/ buffer of 10000 took 1.84s to insert 100000 entries
#
# test 3:
# insert w/o buffer took 54.42s to insert 100000 entries
# multisert w/ buffer of 10000 took 1.9s to insert 100000 entries
#
# test 4:
# insert w/o buffer took 53.38s to insert 100000 entries
# multisert w/ buffer of 10000 took 1.81s to insert 100000 entries
#
# test 5:
# insert w/o buffer took 53.52s to insert 100000 entries
# multisert w/ buffer of 10000 took 1.78s to insert 100000 entries
As we can see, ~30x performance increase.
The performance test was run on a computer with the following specs:
Model Name: MacBook Air
Model Identifier: MacBookAir4,2
Processor Name: Intel Core i5
Processor Speed: 1.7 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 3 MB
Memory: 4 GB
All data was written to a mysql instance on localhost.
Buffer Sizes
Let's take a look at how buffer size comes into play.
We ran 3 separate and independent tests on the same computer as above. Additionally, also note that a buffer size of 0 and 1 are basically identical.
If we look at using a buffer size ranging from 0 - 10, we see the following performance:
If we take a step back and look at buffer sizes ranging from 0 - 100, we see the following performance:
Finally, if we look at buffer sizes ranging from 0 - 1,000 and 0 - 10,000 we see the following performance (spoiler alert: not much difference, just more data points!):
As can be seen, we see vastly improved performance as we increment our buffer from 0 - 100, but then level off thereafter.
FAQ
Packet Too Large / Connection Lost Errors
You may run into the "Packet Too Large" error when attempting to run a multisert. This can comeback as this error explicitly or as a "Connection Lost" error, depending on your mysql client.
To learn more, read the documentation.
If you need to you can adjust the buffer size by setting max_buffer_count
attribute. Generally, 10,000 to 100,000 is a pretty good starting range.
Does it work with Dates?
Yes, just pass in a Date instance and it will be converted to a mysql friendly format under the hood ("%Y-%m-%d"). If you need a special format, convert the date to a string that is in the form you want before passing it into Multisert.
Does it work with Times?
Yes, just pass in a Time instance and it will be converted to a mysql friendly format under the hood ("%Y-%m-%d %H:%M:%S"). If you need a special format, convert the time to a string that is in the form you want before passing it into Multisert.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
License
Copyright 2013 Jeff Iacono
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.