Project

rsgrep

0.0
No commit activity in last 3 years
No release in over 3 years
Pure Ruby implementation of the sorted grep command.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

 Project Readme

Rsgrep

This is a pure Ruby implementation with the same goal as the small but amazing sorted grep program written by Stephen C. Losen.

It is designed for use on large, lexicographically sorted files. It allows you to search for lines that begin with a certain pattern (searching for anything at a position anywhere other than the start of a line isn't possible using a binary search).

Installation

Add this line to your application's Gemfile:

gem 'rsgrep'

And then execute:

$ bundle

Or install it yourself as:

$ gem install rsgrep

Usage

The gem monkey patches into the File class. It can be used in the following two ways:

require 'rsgrep'

puts File.sgrep("key pattern", "path/to/file.txt")
#=> array of all lines that start with "key pattern", empty array for no
#   matches.

# or ...

f = File.open("path/to/file.txt")
puts f.sgrep("key pattern")
#=> array of all lines that start with "key pattern", empty array for no
#   matches.

f.close

You can pass both of these functions an options hash. Here are some examples of the options you can pass:

require 'rsgrep'

f = File.open("path/to/file.txt")

# Case insensitive search
f.sgrep("PaTTern", :insensitive => true)

f.close

NOTE: There are a lot of caveat involved in getting this to work properly. For example, you cannot do a case insensitive search on a file that is not sorted in a case insensitive fashion. The results will not be what you expect.

This will be true of almost all options you pass to rsgrep. You will get the best results on a file that uses alphanumeric characters and only uses one casing (upper or lower, doesn't matter which).

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

A note on the specs...

Because writing specs for this required having a very large file to scan, I had to choose a very large file that was freely available. For obvious reasons, I cannot put the file into this repository but you can download it from here. It's the 0th file of the 3grams dataset in English.

Direct link: http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-3gram-20090715-0.csv.zip

It's about 440mb compressed, 3gb uncompressed. You will need to uncompress it into the spec/data directory in order to run the specs successfully.

This file is a bit of a bad example though, to be honest. I'm only using it at the moment so that the specs give a good idea of how long it takes to scan through such a large file. The reason that this is not a good file to use is because it isn't sorted in a way the rsgrep knows how to process yet. Its handling of capital letters and punctuation are a bit confusing and I haven't yet been able to find a consistent and clean way of scanning it.