ToARFF
Table of Content
- About
- What is an ARFF File
- Installation
- Usage
- Convert from an SQLite Database
- Contributing
- License
About
ToARFF is a ruby library to convert SQLite database files to ARFF files (Attribute-Relation File Format), which is used to specify datasets for WEKA, a machine learning and data mining tool.
What is an ARFF File:
This wiki describes perfectly,
"An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software."
Note: Converting from an SQLite database will generate one ARFF file per table. See this stackoverflow post.
Installation
Add this line to your application's Gemfile:
gem 'to-arff'
And then execute:
$ bundle
Or install it yourself as:
$ gem install to-arff
Usage
###Convert from an SQLite Database
By Specifying Column Types (Recommended)
Use the convert() method and specify the column/attribute types as a json (or nested hash).
require 'to-arff'
# Get the db file from https://github.com/dhrubomoy/to-arff/blob/master/spec/sample_db_files/sample2.db
sample = ToARFF::SQLiteDB.new "/path/to/sample2.db"
# Attribute names and types must be valid and specified as either json or nested hash
# eg. { "table1": {"column11": "NUMERIC",
# "column12": "STRING"
# },
# "table2": {"column21": "class {Iris-setosa,Iris-versicolor,Iris-virginica}",
# "column22": "DATE \"yyyy-MM-dd HH:mm:ss\""
# }
# }
# OR { "table1" => {"column11"=>"NUMERIC",
# "column12"=>"STRING"
# },
# "table2" => {"column21"=>"class {Iris-setosa,Iris-versicolor,Iris-virginica}",
# "column22"=>"DATE \"yyyy-MM-dd HH:mm:ss\""
# }
# }
sample_column_types_param_json = {
"employees": {
"EmployeeId": "NUMERIC",
"LastName": "STRING",
"City": "STRING",
"HireDate": "DATE \"yyyy-MM-dd HH:mm:ss\""
},
"albums": {
"Albumid": "NUMERIC",
"Title": "STRING"
}
}
sample_column_types_param_hash = { "employees" => {"EmployeeId"=>"NUMERIC",
"LastName"=>"STRING",
"City"=>"STRING",
"HireDate"=>"DATE \"yyyy-MM-dd HH:mm:ss\""
},
"albums" => { "Albumid"=>"NUMERIC",
"Title"=>"STRING"
}
}
puts sample.convert column_types: sample_column_types_param_json
#OR
puts sample.convert column_types: sample_column_types_param_hash
Both will produce string similar to following:
@RELATION employees
@ATTRIBUTE EmployeeId NUMERIC
@ATTRIBUTE LastName STRING
@ATTRIBUTE City STRING
@ATTRIBUTE HireDate DATE "yyyy-MM-dd HH:mm:ss"
@DATA
1,"Adams","Edmonton","2002-08-14 00:00:00"
2,"Edwards","Calgary","2002-05-01 00:00:00"
3,"Peacock","Calgary","2002-04-01 00:00:00"
...and so on...
@RELATION albums
@ATTRIBUTE Albumid NUMERIC
@ATTRIBUTE Title STRING
@DATA
1,"For Those About To Rock We Salute You"
2,"Balls to the Wall"
3,"Restless and Wild"
...and so on...
By Specifying Column Names
require 'to-arff'
sample = ToARFF::SQLiteDB.new "/path/to/sample_sqlite.db"
# Column names must be specified like this:
# { "table1" => ["column11", "column12",...],
# "table2" => ["column21", "column22",...]
# }
# OR
# { "table1": ["column11", "column12",...],
# "table2": ["column21", "column22",...]
# }
sample_columns_json = { "albums": ["AlbumId", "Title", "ArtistId"],
"employees": ["EmployeeId", "LastName", "FirstName", "Title"]
}
sample_columns_hash = { "albums" => ["AlbumId", "Title", "ArtistId"],
"employees" => ["EmployeeId", "LastName", "FirstName", "Title"]
}
puts sample.convert columns: sample_columns_json
puts sample.convert columns: sample_columns_hash
Both json and hash parameters for columns:
will return string similar to following:
@RELATION albums
@ATTRIBUTE AlbumId NUMERIC
@ATTRIBUTE Title STRING
@ATTRIBUTE ArtistId NUMERIC
@DATA
1,"For Those About To Rock We Salute You",1
2,"Balls to the Wall",2
...and so on...
@RELATION employees
@ATTRIBUTE EmployeeId NUMERIC
@ATTRIBUTE LastName STRING
@ATTRIBUTE FirstName STRING
@ATTRIBUTE HireDate STRING
@DATA
1,"Adams","Andrew","2002-08-14 00:00:00"
2,"Edwards","Nancy","2002-05-01 00:00:00"
...and so on..
As you can see, "HireDate" Attribute didn't have the correct datatype. It should be "DATE "yyyy-MM-dd HH:mm:ss"", not "STRING"
You can also do following, but might not generate correct datatypes
require 'to-arff'
sample = ToARFF::SQLiteDB.new "/path/to/sample_sqlite.db"
sample.convert tables: ["albums","employees"]
# OR
sample.convert
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/dhrubomoy/to-arff. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
- Fork it ( https://github.com/dhrubomoy/to-arff/fork )
- Create branch (
git checkout -b my-new-feature
) - Make changes. Add test cases for your changes
- Run
rspec spec/
and make sure all the test passes - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
License
The gem is available as open source under the terms of the MIT License.