No commit activity in last 3 years
No release in over 3 years
HDFS agent for triglav, data-driven workflow tool.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

Triglav::Agent::Hdfs

Triglav Agent for Hdfs

Requirements

  • JRuby >= 9.1.5.0
  • Java >= 1.8.0_45

Prerequisites

  • HDFS path to be monitored must be created or modified atomically. To modify HDFS path atomically, use either of following strategies for example:
    • Create a tmp directory and copy files into the directory, then move to the target path
    • Create a marker file such as _SUCCESS after copying is done, and monitor the _SUCESSES file

Installation

Add this line to your application's Gemfile:

gem 'triglav-agent-hdfs'

And then execute:

$ bundle

Or install it yourself as:

$ gem install triglav-agent-hdfs

CLI

Usage: triglav-agent-hdfs [options]
    -c, --config VALUE               Config file (default: config.yml)
    -s, --status VALUE               Status stroage file (default: status.yml)
    -t, --token VALUE                Triglav access token storage file (default: token.yml)
        --dotenv                     Load environment variables from .env file (default: false)
    -h, --help                       help
        --log VALUE                  Log path (default: STDOUT)
        --log-level VALUE            Log level (default: info)

Run as:

TRIGLAV_ENV=development bundle exec triglav-agent-hdfs --dotenv -c config.yml

Configuration

Prepare config.yml as example/config.yml.

You can use erb template. You may load environment variables from .env file with --dotenv option.

serverengine section

You can specify any serverengine options at this section

triglav section

Specify triglav api url, and a credential to authenticate.

The access token obtained is stored into a token storage file (--token option).

hdfs section

This section is the special section for triglav-agent-hdfs.

  • monitor_interval: The interval to watch tables (number, default: 60)
  • connection_info: key-value pairs of hdfs connection info where keys are resource URI pattern in regular expression, and values are connection information

Specification of Resource URI

Resource URI must be a form of:

hdfs://{namespace}/#{path}

Path accepts strftime format such as %Y-%m-%d.

How it behaves

  1. Authenticate with triglav
  • Store the access token into the token storage file
  • Read the token from the token storage file next time
  • Refresh the access token if it is expired
  1. Repeat followings in monitor_interval seconds:
  2. Obtain resource (table) lists of the specified prefix (keys of connection_info) from triglav.
  3. Connect to hdfs with an appropriate connection info for a resource uri, and find tables which are newer than last check.
  4. Store checking information into the status storage file for the next time check.

Development

Prepare

bundle
bundle exec rake vendor_jars
./prepare.sh

Edit .env file or config.yml file directly.

Start

Start up triglav api on localhost.

Run triglav-agent-hdfs as:

TRIGLAV_ENV=development bundle exec triglav-agent-hdfs --dotenv --debug -c example/config.yml

The debug mode with --debug option ignores the last_modification_time value in status file.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/triglav-agent-hdfs/triglav-agent-hdfs. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

ToDo

  • prepare mocks of both triglav and hdfs for tests