Estratto
Estratto is a easy to handle parser based on YAML templating engine. Creating a easy interface for developers, and non developers to extract data from fixed width files
Motivation
In various scenarios the data processment is a crucial step of a integration with partner systems, or data storage. But the task to create parsing and import data from these text format is boring, and causing code duplication in every code project. This project borns to help developers to reduce the time spent in this task, or creating a total delegation scenario to other team responsabilities.
Installation
Add this line to your application's Gemfile:
gem 'estratto'
And then execute:
$ bundle
Or install it yourself as:
$ gem install estratto
Usage
Estratto works with simple input of data to parse file and a yaml layout equivalent.
Example of a default call for parsing:
Estratto::Document.process(file: 'path/to/data.txt', layout: 'path/to/layout.yml')
Layout specifications
Fixed width files is sometimes always painful for human reading, and the layout manual comes in a very useful pdf or spreasheet format.
Here, we'll try to made things fun again, or less painful. 😂
The base layout for YAML file is:
layout:
name: 'jojo stand users'
multi-register: true
prefix: 0..1
registers:
- register: '01'
fields:
- name: name
range: 2..45
type: String
- name: stand
range: 46..75
type: String
And the output will be a array of hashes reflection of your columns:
[
{
name: 'Jotaro Kujo',
stand: 'Star Platinum'
},
{
name: 'Giorno Giovanna',
stand: 'Golden Experience Requiem'
},
{
name: 'Jobin Higashikata',
stand: 'Speed King'
}
]
The structure follows the strict directive
layout:
(base configuration)
registers:
(layouts)
Actually Estratto supports these types of fixed width layouts:
- Batch prefix based registers
- Mono layout based registers (development)
UTF-8 Conversion
Estratto makes use of CharlockHolmes gem to detect the file content encoding and convert it to UTF-8. This approach prevents invalid characters from being present in the output.
CharlockHolmes uses ICU for charset detection. And you need libicu in your environment.
Linux
RedHat, CentOS, Fedora:
yum install libicu-devel
Debian based:
apt-get install libicu-dev
Homebrew
brew install icu4c
Type Coercion
Estratto supports type coercion, with some perks called formats, on layout file.
Data type supported to handle in Estratto
- String
- Integer
- Float
- DateTime
- Date
Default data type in fields is String
, if no one type is setted in field list register.
Registers fields list always respect this base structure:
fields:
- name: name
range: 2..12
type: String
formats:
strip: true
name
is your field identification of field, this value will be your symbol in hashed parsed data
range
is where data is inside the file. (First index is 0)
type
data type to be coerced
formats
receives a specific configuration for data type. Here we can format Strings, and adjust precision for unformatted Float data.
Formats
Formats is the resource for deal with some "surprises" that this type of file can provide to us. Like, super large string fields that has a huge blank space, DateTime with suspicious formatting, or Float without any decimal point, but the manual description shows "Decimal(15, 2)"
String
strip
Works like common ruby String strip method
strip: true
Output example:
#raw_data
'Hierophant Green '
# with strip clause
'Hierophant Green'
Integer
Simple integer values converter. Useful in cases that you need to deal with ids.
Actually we don't have any formats for Integer. :)
#raw_data
'000123'
# coerced
123
#raw_data
'123'
# coerced
123
#raw_data
'a'
# coerced
0
Float
Float is one of most important types here. The fixed width files always respect the non logical format to deliver information.
precision
precision: <integer>
Examples:
precision: 2
#raw data
'12345'
# with precision
123.45
precision: 3
#raw data
'12345'
# with precision
12.345
comma_format
comma_format: <boolean>
Examples:
comma_format: true
#raw data
'123,45'
# with comma formats
123.45
DateTime and Date
The DateTime
and Date
has the same formats attributes. But the difference, one shows DateTime format, and other always respect Date output
format
format: <ruby strptime format pattern>
Examples
format: '%Y%m%d'
#raw data
'20180101'
# with comma formats
#<DateTime: 2018-01-01T00:00:00+00:00 ...>
format: '%d/%m/%Y'
#raw data
'01/01/2018'
# with comma formats
#<DateTime: 2018-01-01T00:00:00+00:00 ...>
General Formats Properties
Sometimes we need to deal with some general patterns on third-party files. Like lacks of informations, or some unexpected exported data pattern.
Allow Empty
The allow_empty
property was designed to deal with randomic unexpected data exported from third-party. Like DateTime
field that has %Y%m%d
format, but in third-party file, some lines cames with
, or 00000000
.
The common return when allow_empty
was marked on field, is nil
.
Tip: allow_empty
could be ommitted when you not need a data saving
Example
fields:
- name: birthdate
range: 2..10
type: DateTime
formats:
allow_empty: true
format: '%d/%m/%Y'
Tests
Simple rake spec
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/Rynaro/estratto.
License
The gem is available as open source under the terms of the MIT License.