Azure Data Lake Store Output Logstash Plugin
This is a Azure Data Laka Store Output Plugin for Logstash.
This plugin uses the official Microsoft Data Lake Store Java SDK with their custom AzureDataLakeFilesystem - ADL protocol, which Microsoft claims is more efficient than WebHDFS.
It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.
Documentation
Configuration example:
input {
...
}
filter {
...
}
output {
adls {
adls_fqdn => "XXXXXXXXXXX.azuredatalakestore.net" # (required)
adls_token_endpoint => "https://login.microsoftonline.com/XXXXXXXXXX/oauth2/token" # (required)
adls_client_id => "00000000-0000-0000-0000-000000000000" # (required)
adls_client_key => "XXXXXXXXXXXXXXXXXXXXXX" # (required)
path => "/logstash/%{+YYYY}/%{+MM}/%{+dd}/logstash-%{+HH}-%{[@metadata][cid]}.log" # (required)
test_path => "testfile" # (optional, default "testfile")
line_separator => "\n" # (optional, default: "\n")
created_files_permission => 755 # (optional, default: 755)
adls_token_expire_security_margin => 300 # (optional, default: 300)
single_file_per_thread => true # (optional, default: true)
retry_interval => 0.5 # (optional, default: 0.5)
max_retry_interval => 10 # (optional, default: 10)
retry_times => 3 # (optional, default: 3)
exit_if_retries_exceeded => false # (optional, default: false)
codec => "json" # (optional, default: default codec defined by Logstash)
}
}
Configuration fields:
Setting | Required | Default | Description |
---|---|---|---|
adls_fqdn |
yes | Azure DLS FQDN | |
adls_token_endpoint |
yes | Azure Oauth Endpoint | |
adls_client_id |
yes | Azure DLS ClientID | |
adls_client_key |
yes | Azure DLS ClientKey | |
path |
yes | The path to the file to write to. Event fields can be used here, as well as date fields in the joda time format, e.g.: /logstash/%{+YYYY-MM-dd}/logstash-%{+HH}-%{[@metadata][cid]}.log
|
|
test_path |
no | testfile |
Path to test access on ADLS. This is used upon startup to ensure you have write access. |
line_separator |
no | \n |
Line separator for events written |
created_files_permission |
no | 755 | File permission for files created |
adls_token_expire_security_margin |
no | 300 | The security margin (in seconds) that shoud be subtracted to the token's expire value to calculate when the token shoud be renewed. (i.e. If the Oauth token expires in 1hour, it will be renewed in "1hour -adls_token_expire_security_margin " |
single_file_per_thread |
no | true | Avoid appending to same file in multiple threads. This solves some problems with multiple logstash output threads and locked file leases in ADLS. If this option is set to true, %{[@metadata][cid]} needs to be used in path config settting. %{[@metadata][cid]} (cid->concurrentId) is generated from a random value computed when the logstash instance starts plus a per thread id. This setting is used to deal with ADLS 0x83090a16 errors. (see Configuration notes) |
retry_interval |
no | 1 | How long(in seconds) should we wait between retries in case of an error. This value is a coefficient and not an absolute value. The wait time is "retry_interval*tries_counter". So, if retry_interval is 1 on the first retry, the wait time will be 1, on the second try will be 2, and so on |
max_retry_interval |
no | 10 | Max Retry Interval. The actual wait time (in seconds) will be min(retry_interval*tries_counter", max_retry_interval) |
retry_times |
no | 3 | How many times should we retry. If retry_times is exceeded, an error will be logged and the event will be discarded. (Set to -1 for unlimited retries) |
exit_if_retries_exceeded |
no | false | If enabled, Logstash will exit if retries are exceeded to avoid loosing events |
codec |
no | Logstash default codec | The Codec that will be used to serialize the event.(ex: CSV, JSON, LINE, etc) If you do not define one, Logstash will use it's default. Please refer to Logstash documentation |
Concurrency and batching:
This plugin relies only on Logstash concurrency/batching facilities and can configured by Logstash's own "pipeline.workers" and "pipeline.batch.size" settings. Also, the concurrency mode of this plugin is set to shared to maximize concurrency.
Configuration notes:
- If single_file_per_thread is enabled (and it is by defaut) and you're using more than one working thread, you'll need to add %{[@metadata][cid]} to your file path. This concatenates a ConcurrentID value to your path to avoid remote concurrency problems in ADLS. Apparently, ADLS locks the file for writing.
- If you you still have errors in your log like "APPEND failed with error 0x83090a16 (Internal server error.)" you should lower your concurrency and/or batching settings to avoid these kind of errors. We try to mitigate the problem with a backoff and retry strategy (retry_interval and retry_times settings) but AFAIK, there's nothing this plugin can do to avoid that entirely, it's an ADLS problem. Maybe they should queue their writing requests internally instead of writing them directly to the FS to avoid file write locks???
- However, unless you have very high concurrency and/or a large batch size, these errors shouldn't be a problem.
Developing
1. Plugin Development and Testing
To install, build and test, just run this command from the project's folder. If you want to deep-dive read into the next sections.
ci/build.sh
Build using Docker
- If you want to use Docker to build, you can use Windows or Linux variants.
- Execute the following commands in your project folder
- The main advantage of using the method below, is that the container will preserve its state until it is removed, making subsequent
bundle install
commands much faster.
Windows (powershell):
# Start the container (it will stop once you exit it)
"docker run -it -v $($pwd -replace "\\", "/")/:/project/ -w /project/ --name jruby jruby:latest bash" | Invoke-Expression
# Start it again (it should stay up now)
docker start jruby
# Enter the container shell and execute the usual build commands
docker exec -it jruby bash
Linux
# Start the container (it will stop once you exit it)
docker run -it -v $pwd:/project -w /project --name jruby jruby:latest bash
# Start it again (it should stay up now)
docker start jruby
# Enter the container shell and execute the usual build commands
docker exec -it jruby bash
Build
-
To get started, you'll need JRuby with the Bundler and Rake gems installed.
-
Install dependencies:
bundle install
- Install Java dependencies from Maven:
rake vendor
- Build your plugin gem:
gem build logstash-output-adls.gemspec
Test
- Update your dependencies
bundle install
- Run unit tests
bundle exec rspec
- Run integration tests you'll need to have docker available within your test environment before running the integration tests. The tests depend on a specific Kafka image found in Docker Hub called spotify/kafka. You will need internet connectivity to pull in this image if it does not already exist locally.
bundle exec rspec --tag integration
2. Running your unpublished Plugin in Logstash
Running your unpublished Plugin in Logstash
- Edit Logstash
Gemfile
and add the local plugin path, for example:
gem "logstash-output-adls", :path => "/your/local/logstash-output-adls"
- Install plugin:
# Logstash 2.3 and higher
bin/logstash-plugin install --no-verify
# Prior to Logstash 2.3
bin/plugin install --no-verify
- Run Logstash with your plugin
bin/logstash -e 'input { stdin { } } output { adls { } }'
At this point any modifications to the plugin code will be applied to this local Logstash setup. After modifying the plugin, simply rerun Logstash.
Run in an installed Logstash
You can use the same 2.1 method to run your plugin in an installed Logstash by editing its Gemfile
and pointing the :path
to your local plugin development directory or you can build the gem and install it using:
- Build your plugin gem
gem build logstash-output-adls.gemspec
- Install the plugin from the Logstash home
# Logstash 2.3 and higher
bin/plugin install /your/local/plugin/logstash-output-adls.gem
- Start Logstash and proceed to test the plugin