Lucene.rb¶ ↑
Lucene.rb is JRuby wrapper for the Lucene document database.
-
Lucene (lucene.apache.org/java/docs/index.html) for querying and indexing.
Status¶ ↑
This library was once included in the neo4j.rb gem (version <= 0.4.6). The neo4j wrapper has now replaced this library with the included Java neo4j-lucene component. The main reason for this split was that neo4j now uses two phase commit to keep the databases in sync.
TODO¶ ↑
-
Upgrade to newer lucene 3.x
-
Check the thread synchronization and locking - is this needed ?
Installation¶ ↑
Install JRuby¶ ↑
The easiest way to install JRuby is by using RVM, see rvm.beginrescueend.com. Otherwise check: kenai.com/projects/jruby/pages/GettingStarted#Installing_JRuby
Lucene.rb¶ ↑
Lucene provides:
-
Flexible Queries - Phrases, Wildcards, Compound boolean expressions etc…
-
Field-specific Queries eg. title, artist, album
-
Sorting
-
Ranked Searching
The Lucene index will be updated after the transaction commits. It is not possible to query for something that has been created inside the same transaction as where the query is performed.
Lucene Document¶ ↑
In Lucene everything is a Document. A document can represent anything textual: A Word Document, a DVD (the textual metadata only), or a Neo4j.rb node. A document is like a record or row in a relationship database.
The following example shows how a document can be created by using the ”<<” operator on the Lucene::Index class and found using the Lucene::Index#find method.
Example of how to write a document and find it:
require 'lucene' include Lucene # the var/myindex parameter is either a path where to store the index or # just a key if index is kept in memory (see below) index = Index.new('var/myindex') # add one document (a document is like a record or row in a relationship database) index << {:id=>'1', :name=>'foo'} # write to the index file index.commit # find a document with name foo # hits is a ruby Enumeration of documents hits = index.find{name == 'foo'} # show the id of the first document (document 0) found # (the document contains all stored fields - see below) hits[0][:id] # => '1'
Notice that you have to call the commit method in order to update the index (both disk and in memory indexes). Performing several update and delete operations before a commit will give much better performance than committing after each operation.
Keep indexing on disk¶ ↑
By default Neo4j::Lucene keeps indexes in memory. That means that when the application restarts the index will be gone and you have to reindex everything again.
To store indexes on file:
Lucene::Config[:store_on_file] = true Lucene::Config[:storage_path] => '/home/neo/lucene-db'
When creating a new index the location of the index will be the Lucene::Config + index path Example:
Lucene::Config[:store_on_file] = true Lucene::Config[:storage_path] => '/home/neo/lucene-db' index = Index.new('/foo/lucene')
The example above will store the index at /home/neo/lucene-db/foo/lucene
Indexing several values with the same key¶ ↑
Let say a person can have several phone numbers. How do we index that?
index << {:id=>'1', :name=>'adam', :phone => ['987-654', '1234-5678']}
Id field¶ ↑
All Documents must have one id field. If an id is not specified, the default will be: :id of type String. A different id can be specified using the field_infos id_field property on the index:
index = Index.new('some/path/to/the/index') index.field_infos.id_field = :my_id
To change the type of the my_id from String to a different type see below.
Conversion of types¶ ↑
Lucene.rb can handle type conversion for you. (The Java Lucene library stores all the fields as Strings) For example if you want the id field to be a Fixnum
require 'lucene' include Lucene index = Index.new('var/myindex') # store the index at dir: var/myindex index.field_infos[:id][:type] = Fixnum index << {:id=>1, :name=>'foo'} # notice 1 is not a string now index.commit # find that document, hits is a ruby Enumeration of documents hits = index.find(:name => 'foo') # show the id of the first document (document 0) found # (the document contains all stored fields - see below) doc[0][:id] # => 1
If the field_info type parameter is not set then it has a default value of String.
Storage of fields¶ ↑
By default only the id field will be stored. That means that in the example above the :name field will not be included in the document.
Example
doc = index.find('name' => 'foo') doc[:id] # => 1 doc[:name] # => nil
Use the field info :store=true if you want a field to be stored in the index (otherwise it will only be searchable).
Example
require 'lucene' include Lucene index = Index.new('var/myindex') # store the index at dir: var/myindex index.field_infos[:id][:type] = Fixnum index.field_infos[:name][:store] = true # store this field index << {:id=>1, :name=>'foo'} # notice 1 is not a string now index.commit # find that document, hits is a ruby Enumeration of documents hits = index.find('name' => 'foo') # let say hits only contains one document so we can use doc[0] for that one # that document contains all stored fields (see below) doc[0][:id] # => 1 doc[0][:name] # => 'foo'
Setting field infos¶ ↑
As shown above you can set field infos like this
index.field_infos[:id][:type] = Fixnum
Or you can set several properties like this:
index.field_infos[:id] = {:type => Fixnum, :store => true}
Tokenized¶ ↑
Field infos can be used to specify if the should be tokenized. If this value is not set then the entire content of the field will be considered as a single term.
Example
index.field_infos[:text][:tokenized] = true
If not specified, the default is ‘false’
Analyzer¶ ↑
Field infos can also be used to set which analyzer should be used. If none is specified, the default analyzer - org.apache.lucene.analysis.standard.StandardAnalyzer (:standard) will be used.
index.field_infos[:code][:tokenized] = false index.field_infos[:code][:analyzer] = :standard
The following analyzer is supported
* :standard (default) - org.apache.lucene.analysis.standard.StandardAnalyzer * :keyword - org.apache.lucene.analysis.KeywordAnalyzer * :simple - org.apache.lucene.analysis.SimpleAnalyzer * :whitespace - org.apache.lucene.analysis.WhitespaceAnalyzer * :stop - org.apache.lucene.analysis.StopAnalyzer
For more info, check the Lucene documentation, lucene.apache.org/java/docs/
Simple Queries¶ ↑
Lucene.rb support search in several fields: Example:
# finds all document having both name 'foo' and age 42 hits = index.find('name' => 'foo', :age=>42)
Range queries:
# finds all document having both name 'foo' and age between 3 and 30 hits = index.find('name' => 'foo', :age=>3..30)
Lucene Queries¶ ↑
If the query is string then the string is a Lucene query.
hits = index.find('name:foo')
For more information see: lucene.apache.org/java/2_4_0/queryparsersyntax.html
Advanced Queries (DSL)¶ ↑
The queries above can also be written in a lucene.rb DSL:
hits = index.find { (name == 'andreas') & (foo == 'bar')}
Expression with OR (|) is supported, example
# find all documents with name 'andreas' or age between 30 and 40 hits = index.find { (name == 'andreas') | (age == 30..40)}
Sorting¶ ↑
Sorting is specified by the ‘sort_by’ parameter Example:
hits = index.find(:name => 'foo', :sort_by=>:category)
To sort by several fields:
hits = index.find(:name => 'foo', :sort_by=>[:category, :country])
Example sort order:
hits = index.find(:name => 'foo', :sort_by=>[Desc[:category, :country], Asc[:city]])
Thread-safety¶ ↑
The Lucene::Index is thread safe. It guarantees that an index is not updated from two threads at the same time.
Lucene Transactions¶ ↑
Use the Lucene::Transaction in order to do atomic commits. By using a transaction you do not need to call the Index.commit method.
Example:
Transaction.run do |t| index = Index.new('var/index/foo') index << { id=>42, :name=>'andreas'} t.failure # rollback end result = index.find('name' => 'andreas') result.size.should == 0
You can find uncommitted documents with the uncommitted index property.
Example:
index = Index.new('var/index/foo') index.uncommited #=> [document1, document2]
Notice that even if it looks like a new Index instance object was created the index.uncommitted may return a non-empty array. This is because Index.new is a singleton - a new instance object is not created.