KMDB
The km-db
gem should be useful to KissMetrics (KM) users.
Its aim is to efficiently process data obtained with KM's "Data Export" feature.
Its main feature is to import dumps directly from S3 into a SQL database, optimized for typical queries (in particular, partitioned along the time dimension).
Once imported, you can run complex queries against your visit history, for instance run multivariate analysis.
Beware though, KM data can be huge, and processing it is taxing!
Installing
If you want to run "just" KM-DB, you might want to just use the app.
Otherwise, add this to your Gemfile if you're using Bundler:
gem 'km-db'
Configuration
KMDB is configured through environment variables. We recommend storing this
settings in a .env
file if running locally, and using foreman to start
KMDB commands with the environment set.
Preparing your database
KMDB requires a MySQL database (to store events, properties, etc) and a Redis store running (to store batch jobs and cache data).
Set the following:
-
DATABASE_URL
(required), e.g.mysql2://km_db_test@localhost/km_db_test
-
KMDB_REDIS_URL
[localhost], e.g.redis://localhost/14
Then run:
$ kmdb-flush
to prepare your database.
Optimizing your database
If your dataset is large (over 1 million events), KMDB can partition your database, i.e. transparently split large tables into smaller buckets of continuous time periods.
Set the following:
-
KMDB_MIN_DATE
(required), e.g. '2014-01-01' -
KMDB_MAX_DATE
(required), e.g. '2016-01-01' -
KMDB_DAYS_PER_PARTITION
(required), e.g. '7'
Then run:
$ kmdb-partition
Notes:
- MySQL only supports up to 1024 partitions.
- You shoud aim for less than 1 million events per partitions for performance.
- You should run this before importing data, but it's possible to re-run it.
The
MIN_DATE
will be ignored, and partitions will be added up to the newMAX_DATE
(if larger).
Importing data
KMDB will fetch JSON files form the S3 bucket where you instructed KissMetrics to back up your data, parse them, and store information in the database.
It does so using resque for high parallelism of the import process; in our experience, it's perfectly possible to import 100GB of data in a few hours.
Set the following:
-
RESQUE_WORKERS
(1), number of worker nodes. -
KMDB_MIN_REVISION
(optional, default 1), first KissMetrics revision file you want to import. -
KMDB_REVISION_LOOKAHEAD
(10), how many revision files to check after the last known one -
KMDB_BATCH_SIZE
(100), how many events to process per batch (advisory, may be higher as an entire second's worth of events will always be processed in one batch to preserve ordering). -
AWS_BUCKET
(required), the name of the S3 bucket where the data is stored. -
AWS_ACCESS_KEY_ID
(required). -
AWS_SECRET_ACCESS_KEY
(required).
Ignoring some users
You may want to ignore all events and properties for certain users, for instance the administrative users of your site (or employees).
Simply add their identities to the ignored_users
table before import.
Whitelisting events
It's typical to have some noisy and/or shorter-lived events sent to KissMetrics, e.g. for testing purposes or for temporary monitoring.
Should you only want to import certain events, add their names to the
whitelisted_events
table before starting import.
If the table is left empty, all events will be imported.
Dealiasing users
When KissMetrics finds a way to tie two user identities as being a single actual user, it stores an "aliasing" event. KMDB de-aliases users automatically during import, and will store all events and properties against a single user identity (one that's numeric if any, otherwise the lexicographically lowest).
Using imported data
Using SQL directly
KMDB tries to stay close to the KissMetrics data, leaving you to interpret it.
As such, the main tables are unsurprisingly events
and properties
.
Here's a summary of the data model:
events
has one row for each imported event:
events |
---|
id |
t |
n |
user_id |
properties
has one row for each property ever set on events or users
properties |
---|
id |
t |
key |
value |
user_id |
event_id |
events.n
and properties.key
reference the id
column of the keys
table;
this is done for performance reasons (event and property names are only stored
once):
keys |
---|
id |
string |
KMDB also keeps the original user identities around in users
, although you'll
probably never need them:
users |
---|
id |
name |
as well as all aliasing events:
aliases |
---|
id |
name1 |
name2 |
Using ActiveRecord
The KMDB
module exposes four ActiveRecord
classes:
Event
, Property
, User
are the main domain objects.
Key
is used to intern strings (event and property names) for performance.
Please consult the source of these models for details.