Project

csv2avro

0.01
No commit activity in last 3 years
No release in over 3 years
Convert CSV files to Avro like a boss.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 0.5
< 2.0, >= 1.9
~> 0.10
~> 10.0
~> 3.2

Runtime

~> 1.7
~> 0.2
 Project Readme

CSV2Avro

Convert CSV files to Avro like a boss.

CircleCI

Installation

$ gem install csv2avro

or if you prefer to live on the edge, just clone this repository and build it from scratch.

Usage

Basic

$ csv2avro --schema ./spec/support/schema.avsc ./spec/support/data.csv

This will process the data.csv file and creates a data.avro file and a data.bad file with a report of the bad rows.

You can override the bad rows report file location with the --bad-rows [BAD_ROWS] option.

Streaming

$ cat ./spec/support/data.csv | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro

This will process the input stream and push the avro data to the output stream. If you're working with streams you will need to specify the --bad-rows location.

Advanced features

AWS S3 storage

aws s3 cp s3://csv-bucket/transactions.csv - | csv2avro --schema ./transactions.avsc --bad-rows ./transactions.bad | aws s3 cp - s3://avro-bucket/transactions.avro

This will stream your file stored in AWS S3, converts the data and pushes it back to S3. For more information, please check the AWS CLI documentation.

Convert compressed files

gunzip -c ./spec/support/data.csv.gz | csv2avro --schema ./spec/support/schema.avsc --bad-rows ./spec/support/data.bad > ./spec/support/data.avro

This will uncompress the file and converts it to avro, leaving the original file intact.

More

For a full list of available options, run csv2avro --help

$ csv2avro --help
Version 1.3.0 of CSV2Avro
Usage: csv2avro [options] [file]
    -s, --schema SCHEMA              A file containing the Avro schema. This value is required.
    -b, --bad-rows [BAD_ROWS]        The output location of the bad rows report file.
    -d, --delimiter [DELIMITER]      Field delimiter. If none specified, then comma is used as the delimiter.
    -l, --line-ending [LINE_ENDING]  Line ending character used as row separator in CSV parsing
    -a [ARRAY_DELIMITER],            Array field delimiter. If none specified, then comma is used as the delimiter.
        --array-delimiter
    -D, --write-defaults             Write default values.
    -c, --stdout                     Output will go to the standard output stream, leaving files intact.
    -h, --help                       Prints help

Contributing

  1. Fork it ( https://github.com/sspinc/csv2avro/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request