RIALTO-ETL
RIALTO-ETL is a set of ETL tools for RIALTO, Stanford Libraries' research intelligence project
Dependencies
- Ruby >= 2.5.0
Usage
Pipeline to harvest organizations from Stanford Profiles API into RIALTO Core
exe/extract call StanfordOrganizations > organizations.json
exe/transform call StanfordOrganizations -i organizations.json > organizations.sparql
exe/load call Sparql -i organizations.sparql
Pipeline to harvest researchers from Stanford Profiles API into RIALTO Core
Notes:
- The extract step takes about 20 min as it has to make ~796 requests to get the full 1.6GB of data.
- The transform step depends on
organizations.json
from organizations pipeline. - The transform step takes about 35 minutes on a single thread
- The load step takes about 7 hours on a single thread
exe/extract call StanfordResearchers > researchers.ndj
exe/transform call StanfordPeople -i researchers.ndj > researchers.sparql
exe/load call Sparql -i researchers.sparql
Composite ETL
The composite ETL tools allow you to streamline operations by running extracts, transforms, and loads on batches of data. These tools are available for grants and publications, currently.
Pipeline to harvest grants from Stanford SeRA API into RIALTO Core
Notes:
- The transform step depends on
researchers.ndj
from the researcher pipeline - Extracting and transforming will be sped up by setting a batch size with
-s
. - The load step can be skipped with the
--skip-load
flag. - The extract and transform steps will be skipped if the files already exist. Use the
--force
/-f
flag to overwrite files. - If you need to run the transform and load on already extracted grant data, you can run them independently via
exe/transform call StanfordGrants -i my_extracted_grant_file.json > my_transformed_grant_file.sparql
andexe/load call Sparql -i my_transformed_grant_file.sparql
.
exe/transform call StanfordPeopleList -i researchers.ndj > researchers.csv
exe/grants load -s 3 -i researchers.csv
See the output of exe/grants help load
to see more of the available CLI options
Pipeline to harvest publications from Web of Science API into RIALTO Core
Notes:
- The extract step can be skipped with the
--skip-extract
flag, in which case cached files in the input directory (--input-directory
/-d
flag) will be used for transformation and loading. - The load step can be skipped with the
--skip-load
flag. - The extract and transform steps will be skipped if the files already exist. Use the
--force
/-f
flag to overwrite files. - If you need to run the transform and load on already extracted publication data, you can run them independently via
exe/transform call WebOfScience -i my_extracted_publication_file.ndj > my_transformed_publication_file.sparql
andexe/load call Sparql -i my_transformed_publication_file.sparql
.
exe/publications load -d ../rialto-sample-data/publications -o data/transformed_publications
See the output of exe/publications help load
to see more of the available CLI options
Authentication
If you are using the StanfordResearchers
or StanfordOrganizations
extract methods, you will first need to obtain a token for the CAP API and set the Settings.cap.api_key
value to this token. To set this value, either set an environment variable named SETTINGS__CAP__API_KEY
or add the value for this to config/settings.local.yml
(which is ignored under version control and should never be checked in), like so:
cap:
api_key: 'foobar'
Similarly, if you are using the SPARQL writer, then you need to set SETTINGS__SPARQL_WRITER__API_KEY
or:
sparql_writer:
api_key: 'key' # SPARQL Proxy API key
Tokens are stored in shared_configs.
Run the extract process
Run exe/extract
to run a named extractor and print output to STDOUT:
$ exe/extract call StanfordResearchers
{"count":10,"firstPage":true,"lastPage":false,"page":1,"totalCount":29089,"totalPages":2909,"values":[{"administrativeAppointments":[...
List registered extract processes
Run exe/extract list
to print out the list of callable extractors.
Transform
Run exe/transform
to run a named transformer, based on Traject, on a named input file and print output to STDOUT:
$ exe/transform call StanfordOrganizationsToVivo -i stanford_organizations.json
{"@id":"http://authorities.stanford.edu/orgs#vice-provost-for-undergraduate-education/stanford-introductory-studies/freshman-and-sophomore-programs","@type":"http://vivoweb.org/ontology/core#Division","rdfs:label":"Freshman and Sophomore Programs","vivo:abbreviation":["FFQH"]}
Run exe/transform list
to print out the list of callable transformers.
Load
Run exe/load
to run a named extractor and print output to STDOUT:
$ exe/load call Sparql -i whatever.sparql
...
Configuration
RIALTO-ETL uses the config gem to manage configuration, allowing for flexible variation of configs between environments and hosts. By default, the gem assumes it is running in the 'production'
environment and will look for its configurations per the config gem documentation. To explicitly set the environment to test
or development
, set an environment variable named ENV
.
Help
$ exe/extract help
Commands:
extract call NAME # Call named extractor (`extract list` to see available names)
extract help [COMMAND] # Describe subcommands or one specific subcommand
extract list # List callable extractors
$ exe/transform help
Commands:
transform call NAME # Call named transformer (`transform list` to see available names)
transform help [COMMAND] # Describe subcommands or one specific subcommand
transform list # List callable transformers
$ exe/load help
Commands:
load call NAME -i, --input-file=FILENAME # Call named loader (` list` to see available names)
load help [COMMAND] # Describe available commands or one specific command
load list # List callable loaders
Documentation
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Sample Data
The sample data we use to work with Rialto::Etl is contained in a private GitHub repository
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/sul-dlss/rialto-etl.