No commit activity in last 3 years
No release in over 3 years
Harvest DOR object metadata by the item or collection, plus code framework to write Solr docs to index
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

>= 0
>= 0
~> 3.0
= 0.42.0
>= 0
>= 0

Runtime

 Project Readme

Harvestdor::Indexer

Build Status | Code Climate Test Coverage | Gem Version

A Gem to harvest meta/data from DOR and the skeleton code to index it and write to Solr.

Installation

Add this line to your application's Gemfile:

gem 'harvestdor-indexer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install harvestdor-indexer

Usage

You must override the index method and provide configuration options. It is recommended to write a script to run it, too - example below.

Configuration / Set up

Create a yml config file for your collection going to a Solr index.

See spec/config/ap.yml for an example. You will want to copy that file and change the following settings:

  • whitelist
  • dor fetcher service_url
  • solr url
  • harvestdor log_dir, log_nam

Whitelist

The whitelist is how you specify which objects to index. The whitelist can be

  • an Array of druids inline in the config yml file
  • a filename containing a list of druids (one per line)

If a druid, per the object's identityMetadata at purl page, is for a

  • collection record: then we process all the item druids in that collection (as if they were included individually in the whitelist)
  • non-collection record: then we process the druid as an individual item

Override the Harvestdor::Indexer.index method

In your code, override this method from the Harvestdor::Indexer class

# create Solr doc for the druid and add it to Solr
#  NOTE: don't forget to send commit to Solr, either once at end (already in harvest_and_index), or for each add, or ...
def index resource

  benchmark "Indexing #{resource.druid}" do
    logger.debug "About to index #{resource.druid}"
    doc_hash = {}
    doc_hash[:id] = resource.druid

    # you might add things from Indexer level class here
    #  (e.g. things that are the same across all documents in the harvest)
    solr.add doc_hash
    # TODO: provide call to code to update DOR object's workflow datastream??
  end
end

Run it

(bundle install)

You may want to write a script to run the code. Your script might look like this:

#!/usr/bin/env ruby $LOAD_PATH.unshift(File.join(File.dirname(FILE), '..')) $LOAD_PATH.unshift(File.join(File.dirname(FILE), '..', 'lib')) require 'rubygems' begin require 'your_indexer' rescue LoadError require 'bundler/setup' require 'your_indexer' end config_yml_path = ARGV.pop if config_yml_path.nil? puts "** You must provide the full path to a collection config yml file **" exit end indexer = Harvestdor::Indexer.new(config_yml_path, opts) indexer.harvest_and_index

Then you run the script like so:

 $ ./bin/indexer config/(your coll).yml

Run from deployed instance, as that box is already set up to be able to talk to DOR Fetcher service and to SUL Solr indexes.

Contributing