SpreadsheetImport
SpreadsheetImport gem allows importing data from csv, xls, xls, xlsx and ods file to database. SpreadsheetImport contains 3 part (Reader, Processor, Importer). Each part depends on seprate gems and you can replace any part without effecting other part.
-
Reader: Reads data from spreadsheet file. Following class are used for this functionality.
- SpreadsheetImport::BaseReader: Base class of every Reader class.
- SpreadsheetImport::SimpleReader: Inherits from base reader. Provides
each_row(required_columns)
method. This class depends on simple-spreadsheet gem https://github.com/zenkay/simple-spreadsheet.
-
Processor: Processes the data read by Reader. SpreadsheetImport gem defines one processor
- SpreadsheetImport::BaseProcessor: BaseProcessor contains various hooks to process and validate data provided by Reader.
-
Importer: Responsible for importing data recieved from data processor to database. SpreadsheetImport gem provides following importor.
- SpreadsheetImport::BaseImporter: Base class of all importer
- SpreadsheetImport::ActiveRecordImporter::BaseImporter: Supports import using activerecord. Depends on activerecord gem.
- SpreadsheetImport::ActiveRecordImporter::BulkImporter: Supports import using activerecord-import gem(https://github.com/zdennis/activerecord-import)
Installation
Add this line to your application's Gemfile:
gem 'spreadsheet_import'
gem 'simple-spreadsheet' # if want to use SpreadsheetImport::SimpleReader
gem 'activerecord' # if want to use any impoter in SpreadsheetImport::ActiveRecordImporter
gem 'activerecord-import' # if want to use SpreadsheetImport::ActiveRecordImporter::BulkImporter
And then execute:
$ bundle
Usage
SpreadsheetImport.import(file_url, mapping, model, options)
-
file_url: url of csv file
-
mapping: database column name to spreadsheet column number mapping hash.
Example:
{city: 1, tax_rate: 3}
city and tax_rate are column name in database. 1, 3 are column number in spreadsheet file.
-
model: model class for which spreadsheet needs to be imported
-
options: Hash of options
- reader: Custom reader class object
- data_processor: Custom data processor class object
- importer: Custom importer class object
Details
-
SpreadsheetImport::BaseReader:
Class inheriting from BaseReader should define
each_row(required_columns)
method. Method should accept column position as array and yield value of those columns as array for each spreadsheet row. It is upto you to implement this functionality anyway you want. BaseReader also accepts start_row(row from where reader should start reading) and end_row(row at which reader should end reading) as option while intiantiating reader. So you can also handle start_row and end_row option ineach_row
method. -
SpreadsheetImport::SimpleReader:
If you want to use this reader add simple-spreadshee gem to your project. If you dont want to use simple spreadsheet gem or have any alternative gem to read spreadsheet file then you can create your own reader class by inheriting from BaseReader. You can call all simple-spreadsheet gem method on SimpleReader object.
Example:
reader = SimpleReader('file/abc.xls', start_row: 2, end_row: 30) reader.cell(1, 2) # this is simple-spreadsheet gem method
Note: SpreadsheetImport::SimpleReader includes Enumerable module. So you can use enumerable functions like each, select etc
-
Spreadsheetimport::BaseProcessor:
If you want to process data before importing it to db then create a new processor class which inherits from BaseProcessor. BaseProcessor constructor accepts following parameters
-
reader: Reader class object
-
mapping: database column name to spreadsheet column number mapping hash
-
options: Following options are supported row_processor, row_validator and only_extract_valid_rows.
- row_processor: Accepts a class which would be responsible for processing rows comming from
each_row
method of reader. Row Processor class should defineprocess
method.process
method will called with unprocessed_row and current instance of data processor as arguments. unprocessed_row is in following format
{db_column_name1: value_from_spreadsheet, db_column_name1: value_from_spreadsheet}
process
method should return hash in below format.{db_column_name1: processed_value, db_column_name1: processed_value}
There is alternative to process data other then providing row_processor option. You can inherit from BaseProcessor and add
process_row_before_import
method to your class. This method will recieve unprocessed_row and you should return processed row from it.- row_validator: Accepts a class which is responsible for validating a processed row. Row validator class should define
validate
method. This method is called with processed row returned fromprocess
method of row processor orprocess_row_before_import
method and current instance of data processor as second argument.validate
method should return true or false. There is alternative to validate data other then providing row_validatir option. You can inherit from BaseProcessor and addvalid_row_for_import?
method to your class. This method will recieve processed_row and you should return return true or false. - only_extract_valid_rows: If true
spreadsheet_rows
method will yield on rows that are valid otherwise it will yield each row. Row is valid or not is decided either by row_validator classvalidate
method or byvalid_row_for_import?
method.
- row_processor: Accepts a class which would be responsible for processing rows comming from
Example:
Consider below csv file
city,county,tax rate, Palm Desert,Some county, 0.2, Marine Corps*,Some county 2,0.9,
# Custome data processor class TaxFileProcessor < SpreadsheetImport::BaseProcessor def process_row_before_import(row) city = row[:city] if city.present? city = city.gsub(/[^a-zA-Z0-9\s]/,'').strip # remove special character and remove spaces from start and end end { city: city, tax_rate: row[:tax_rate].to_f * 100 } end def valid_row_for_import?(row) row[:city].present? end end reader = SimpleReader('file/abc.csv, start_row: 2) data_processor = TaxFileProcessor.new(reader, {city: 1, tax_rate: 3}, only_extract_valid_rows: true) data_processor.spreadsheet_rows {|row| puts row} # o/p # { city: 'Palm Desert', tax_rate: 20.0 } # { city: 'Marine Corps', tax_rate: 90.0 }
Note: SpreadsheetImport::BaseProcessor includes Enumerable module. So you can use enumerable functions like each, select etc
-
-
SpreadsheetImport::BaseImporter: It does not implement any functionality. It just specifies some hooks that its subclass should implement. SpreadsheetImport::BaseImporter constructor accepts model and options as argument. It accepts only one option i.e data_processor object. It uses data processor
spreadsheet_rows
method which yields processed_row and valid(true or false) to call either handle_valid_row or handle_invalid_row method. By default handle_valid_row callscreate_or_update_record(row)
method.handle_invalid_row
is blank and you can define it in you subclass if you want to handle_invalid_rows. This may be helpful in case you want to log invalid rows from csv somewhere. -
SpreadsheetImport::ActiveRecordImporter::BaseImporter: Constructor accepts following arguments
- model: ActiveRecord model class to which we want to import data
- options
-
data_processor: Data processor object
-
skip_validations: Skips validation if true. false by default
-
skip_callbacks: Skips callback if true. false by default
-
unique_by_attributes: accepts list of attribute which should be unique. If its not unique then new record won't be inserted. Decision to whether update existing record is made by update_existing_record option. For finding already existing record where query is fired which does case sensitive comparision between attribute from spreedsheet and db. If you want to add some custom implementation then create your own importer which inherits from SpreadsheetImport::ActiveRecordImporter::BaseImporter and define
find_duplicate_for_unique_by_attributes(data)
method. find_duplicate_for_unique_by_attributes will be called with data(processed_row). It should return ActiveRecord::Relation object or model object or array of model object. Currently it has following implementationscoped_model.where(data.slice(*unique_by_attributes))
.Note: Always use
scoped_model
method to call any ActiveRecord method -
update_existing_record: If true will update the existing record. true by default if unique_by_attributes is present.
-
scoped_unique: class method or scope name as symbol. This scope is applied for finding record which are already present in database.
-
It defines
create_or_update_record
method called byhandle_valid_row
. -
SpreadsheetImport::ActiveRecordImporter::BulkImporter: This importer utilizes activerecord-import gem. Its faster than SpreadsheetImport::ActiveRecordImporter::BaseImporter. It is recomended for large spreedsheet files. ActiveRecord callbacks will not be called when BulkImporter is used. It inherits from SpreadsheetImport::ActiveRecordImporter::BaseImporter, so its constructor supports same argument. Passing skip_callbacks as true or false doesn't have any effect and callbacks will always be skipped. This is a penalty that you have to pay for performance :). It also support addition option of batch_size. It is the number of records that would be inserted at once. Default batch size is 100.
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/spreadsheet_import. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
The gem is available as open source under the terms of the MIT License.