Milemarker -- track (and probably log) progress in batch jobs
Never again write code of the
form log.info "Finished 1_000 in #{secs} seconds at a rate of #{total.to_f / secs}"
.
Usage
require 'milemarker'
require 'logger'
input_file = "records.ndj"
# Create a new milemarker. Default batch_size is 1_000
milemarker = Milemarker.new(name: "Load #{input_file}", batch_size: 1_000_000)
logger = Logger.new(STDERR)
milemarker.logger = logger
File.open(input_file).each do |line|
do_whatever_needs_doing(line)
milemarker.increment_and_log_batch_line
end
milemarker.log_final_line # if logging is set up
# Identical to the above, but do the logging "by hand"
File.open(input_file).each do |line|
do_whatever_needs_doing(line)
milemarker.increment_and_on_batch { logger.info milemarker.batch_line }
end
logger.info milemarker.final_line
# Sample output
# ...
# I, [2021-11-02T01:51:06.959137 #11710] INFO -- : load records.ndj 8_000_000. This batch 2_000_000 in 26.2s (76_469 r/s). Overall 72_705 r/s.
# I, [2021-11-02T01:51:36.992831 #11710] INFO -- : load records.ndj 10_000_000. This batch 2_000_000 in 30.0s (66_591 r/s). Overall 71_394 r/s.
# ...
# I, [2021-11-02T02:01:56.702196 #11710] INFO -- : load records.ndj FINISHED. 27_138_118 total records in 00h 12m 39s. Overall 35_718 r/s.
Basic usage
Most programs will probably use milemarker
is via
#increment_and_log_batch_line
(or its counterpart #increment_and_on_batch {|milemarker| ... }
). As
the name suggests, this will:
- increment the batch counter
- If the batch counter >= the batch size:
- run the provided block (or write the logline)
- reset count/time/etc for the next batch
Some examples:
# Logging, as above
milemarker = Milemarker.new(batch_size: 1000, name: 'Load myfile')
milemarker.increment_and_on_batch { logger.info milemarker.batch_line }
# Alert when things seem to to take too long
milemarker.increment_and_on_batch do |milemarker|
secs = milemarker.last_batch_seconds
if secs > way_too_long
logger.error "Whoa: #{secs} is too long for a batch of #{milemarker.batch_size}"
end
end
# #on_batch and #increment_and_on_batch can be used to do real (i.e.,
# non-logging) work after every `batch` calls, too
queue = []
my_stuff.each do |doc|
queue << do_something_to(doc)
milemarker.increment_and_on_batch do |milemarker|
write_to_datastore(queue)
queue = []
logger.info milemarker.batch_line
end
end
#incr
and #on_batch(&blk)
are also available separately if you need to be
more explicit and less atomic.
All the components that make up a batch_line (e.g., the records/second as a nice string) are available to roll your own batch line. See the API documentation for details.
Incorporating a logger into milemarker
For standard logging cases, you can also pass in a logger, or let milemarker create one for its own use based on an IO-like object you provide
logger = Logger.new(STDERR)
milemarker = Milemarker.new(name: 'my_process', batch_size: 10_000, logger: logger)
# same thing
milemarker = Milemarker.new(name: 'my_process', batch_size: 10_000)
milemarker.logger = logger
# same thing again
milemarker = Milemarker.new(name: 'my_process', batch_size: 10_000)
milemarker.create_logger!(STDERR)
File.open(input_file).each do |line|
do_whatever_needs_doing(line)
milemarker.increment_and_log_batch_line
end
milemarker.log_final_line
# All the logging methods take an optional :level argument
milemarker.log_final_line(level: :debug)
Structured logging with Milemarker::Structured
Milemarker::Structured
will return hashes for #batch_line
and #final_line
(aliased to #batch_data
and #final_data
, respectively) and pass those
hashes along to whatever logger you provide. #create_logger!
for this
subclass will create a logger that provides json lines instead of text, too.
Presumably, if you pass in your own logger you'll use something like semantic_logger or ougai.
milemarker = Milemarker::Structured.new(name: 'my_process', batch_size: 10_000)
milemarker.create_logger!(STDERR)
File.open(input_file).each do |line|
do_whatever_needs_doing(line)
milemarker.increment_and_log_batch_line
end
# Usually one line; broken up for readability
# {"name":"my_process","batch_count":10_000,"batch_seconds":97.502088,
# "batch_rate":1.035875252230496,"total_count":100,"total_seconds":97.502094,
# "total_rate":1.0358751884856956,"level":"INFO","time":"2021-11-06 17:32:21 -0400"}
Threadsafety
A call to milemaker.threadsafify!
will wrap increment_and_on_batch
(and
increment_and_log_batch_line
) to be a threadsafe atomic operation at the
cost of some performance.
milemarker.threadsafify!
Turning off logging
If the logger is set to nil
, no logging will occur.
# Turn off logging
milemarker.logger = nil
You could also just configure your logger to ignore stuff
milemarker.logger.level = :error
Accuracy
Note that milemarker
isn't designed for real benchmarking. The assumption is
that whatever work your code is actually doing will drown out any
inefficiencies in the milemarker
code, and milemarker numbers can be used to suss out
where weird things are happening.
Installation
Add this line to your application's Gemfile:
gem 'milemarker'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install milemarker
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/billdueber/milemarker.
License
The gem is available as open source under the terms of the MIT License.