Auto-detecting CSV parser
A Ruby gem to read CSVs with auto-detection of encoding and column separator. Just let people provide a CSV and don't require them to think about format details if it can be figured out automatically (while providing a way to set it in case auto-detection fails).
Character set detection is done by either rchardet, uchardet or charlock_holmes.
Installation
Run
gem install rchardet
gem install acsv
or, when using Ruby on Rails, put this in your Gemfile
gem 'rchardet'
gem 'acsv'
and run bundle install
.
Usage
You can use this exactly as the regular CSV
module. Just make sure to load a character-detection library before you require 'acsv'
. Then
use ACSV::CSV
wherever you would have used CSV
.
For example:
require 'rchardet'
require 'acsv'
ACSV::CSV.foreach("spec/files/test_02_semicolon_utf16.csv", headers: true) do |row|
puts row[1] # => '1234'
end
When running this with Ruby's standard CSV, you'll see the error "invalid byte sequence in UTF-8".
Other methods like read
and open
are also supported. When passing strings,
e.g. with new
or parse
, only the separator is auto-detected.
Options
Instead of rchardet, use can also use
uchardet or
charlock_holmes.
Just load them before loading acsv. When multiple are loaded, the first one that
returns an encoding above the confidence level (see below) is used. You can also
specify which method to use by passing the method
option to one of the
ACSV::CSV
methods. Possible values are uchardet
, rchardet
or charlock_holmes
.
Available methods are also available from ACSV::Detect.encoding_methods
.
Character encoding detection also returns a confidence level (between 0 and 1).
By default, each method has its own confidence level which matches its performance,
but you can override it by passing the confidence
option.
Lower-level
This gem also provides some lower-level methods for encoding and separator detection:
require 'rchardet'
require 'acsv'
data = File.read("spec/files/test_02_semicolon_iso8859.csv")
encoding = ACSV::Detect.encoding(data)
puts encoding # => 'ISO-8859-1'
data.force_encoding(encoding)
separator = ACSV::Detect.separator(data)
puts separator # => ';'
Please see the documentation for ACSV::Detect
for more information.
Copyright
Copyright © 2014 wvengen, released under GPLv3+ (see LICENSE.md for details).