No commit activity in last 3 years
No release in over 3 years
MARCXML package for the metacrunch ETL toolkit.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

>= 2.11
 Project Readme

metacrunch-marcxml

Gem Version Code Climate Test Coverage CircleCI

This is the official MARCXML package for the metacrunch ETL toolkit. It allows you to access MARCXML data by a simple and powerful Ruby API.

There is no runtime dependency to metacrunch, so it's fine to use this gem in any Ruby application where you need to access MARCXML data.

Note: Before 2.0 this package was named metacrunch-mab2 and has been renamed to metacrunch-marcxml. Due to this renaming and further development there are breaking API changes if you upgrade from 1.x to 2.x.

Installation

Include the gem in your Gemfile

gem "metacrunch-marcxml", "~> 4.0.0"

and run $ bundle install to install it.

Or install it manually

$ gem install metacrunch-marcxml

Usage example

Note: For working examples on how to use this package in a metacrunch job check out our demo repository.

Loading the library

require "metacrunch/marcxml"

Parsing a MARCXML file

# Load a MARCXML file (from a remote location in this example).
require "open-uri"
marcxml = URI.open("http://d-nb.info/982392028/about/marcxml"){|io| io.read}

# Now parse the file
record = Metacrunch::Marcxml.parse(marcxml)
# .. or 
record = Metacrunch::Marcxml[marcxml]
# .. or
record = Metacrunch::Marcxml(marcxml)

If the MARCXML contains a collection of records you can set collecton_mode: true to get an array of all records. If set to false (the default) you get the first record.

# Parse a record collection
marcxml = %{
    <collection>
        <record>...</record>
        <record>...</record>
        <record>...</record>
    </collection>
}
records      = Metacrunch::Marcxml.parse(marcxml, collection_mode: true)
first_record = Metacrunch::Marcxml.parse(marcxml, collection_mode: false)

Reading the leader

leader = record.leader
# => #<Metacrunch::Marcxml::Record::Leader:0x00007f93741ddc10 ...>

leader.value
# => "00000pam a2200000 c 4500"

Reading control fields

controlfield = record.controlfield("005")
# => #<Metacrunch::Marcxml::Document::Controlfield:0x007fd4c5120ec0 ...>

tag = controlfield.tag
# => "005"

value = controlfield.value
# => "20130926112144.0"

Reading data fields / sub fields

# Find fields matching tag=100 and ind1=1 (author)
datafield_set = record.datafields("100", ind1: "1")
# => #<Metacrunch::Marcxml::Document::DatafieldSet:0x007fd4c4ce4b40 ...>

first_author = datafield_set.first # set is an Enumerable
# => #<Metacrunch::Marcxml::Document::Datafield:0x007fd4c5129480 ...>

# Get the sub fields matching code=a (author name)
subfield_set = first_author.subfields("a")
# => #<Metacrunch::Marcxml::Document::SubfieldSet:0x007fd4c4c779f0 ...>

# because sub field "a" is not repeatable we know there can only be one match
first_author_subfield = subfield_set.first # subfield_set is an Enumerable
# => #<Metacrunch::Marcxml::Document::Subfield:0x007fd4c5129660 ...>

first_author_name = first_author_subfield.value
# => "Orwell, George"

# ... this can be a one liner
first_author_name = record.datafields(100, ind1: "1").subfields("a").values.first

Direct value access using a query string

FEATURE IS EXPERIMENTAL AND SUBJECT TO CHANGE

Access fields as described above is flexible but very verbose. Most of the time you know your data and you are interested in a simple and direct way to access the field values.

For this case we provide a way to query field values using a simple query string.

# Get the value of control field "005"
record["005"]
# => "20130926112144.0"

# Get the first value of data field tag=100, ind1=1, sub field code=a
record["1001*a"].first
# => "Orwell, George"

The query string syntax is simple. Each query string starts with three letters for the tag. If the tag starts with 00 it is considered a query for a control field value. Otherwise it is considered a data field / sub field query. In that case the next two characters are used to match ind1 and ind2. The default value is * which matches every indicator value. #, - and _ are interpreted as blank. The last characters are used to match the code of the sub fields. To query for more than one sub field code you may separate them using commas.

By default sub field values are flattened. To get an array of matching sub field values for each matching data field set flatten_subfields: false.

By default only the values are returned. That means you can't see from which sub field the value was extracted. If you query for just one sub field this information is not needed. But if you query for more than one sub field it may be useful. To get the sub field value and the sub field code to can set values_as_hash: true. This works for control fields as well.

Examples

record["1001#a"] 
# => [["Orwell, George"]]

record["1001#a", flatten_subfields: true] 
# => ["Orwell, George"]

record["020**a,c"]
# => [
#      ["9783548267456", "kart. : EUR 6.00 (DE), EUR 6.20 (AT), sfr 11.00"],
#      ["3548267459", "kart. : EUR 6.00 (DE), EUR 6.20 (AT), sfr 11.00"]
#    ]

record["264#1a,b,c", flatten_subfields: true]
# => ["Berlin", "Ullstein", "2007"]

record["264#1a,b,c", flatten_subfields: true, values_as_hash: true]
# => [{"a"=>"Berlin"}, {"b"=>"Ullstein"}, {"c"=>"2007"}]

License

metacrunch-marcxml is available at github under MIT license.