MovieDB
MovieDB is a multi-threaded ruby wrapper for performing advance statistical computation and high-level data analysis on Movie Data from IMDb. The objective and usage of this tool is to allow producers, directors, writers to make logical business decisions that will generate profitable ROI.
Badges
Technology
- SciRuby is used for all statistical and scientific computations.
- Redis is used to store all data.
- IMDb and TMDb is the source for all film.
- BoxOfficeMojo is where we will be scraping future film.
- Celluloid is used to build the fault-tolerant concurrent programs. Note, if you are using MRI or YARV, multi-threading won't work since these types of interpreters have Global Interpreter Lock (GIL). Fortunately, you can use JRuby or Rubinius, since they don’t have a GIL and support real parallel threading.
Requirements
ruby-2.2.2 or higher
jruby-9.0.0.0
Installation
Redis Installation
This tutorial doesn't cover redis installation. You will find that information at: http://redis.io/topics/quickstart
movieDB is available through Rubygems and can be installed via Gemfile.
gem 'movieDB'
And then execute:
$ bundle install
Or install it yourself as:
gem install movieDB
Console - loading the libraries
$ irb
Require the gem
require 'movieDB'
Initialize MovieDB (multi-thread setup)
m = MovieDB::Movie.pool(size: 2)
Step Process
Fetching and analysing movie data using movieDB is a simple 2 step process.
First, fetch the data from IMDb.
Next, run your choice of statistic.
That's it! It is that simple.
Part 1 - Fetch Data from IMDb
There are 3 ways to find IMDb ids.
-
Search IMDb id via API
-
Search IMDb id via Website
-
Generate random IMDb ids.
Search IMDb id via API
You can read the documentation for IMDb API to see all that you can do with this gem.
i = Imdb::Search.new("Star Trek")
i.movies.size #=> 97
This will return 97 objects related to 'Star Trek'
To collect all the IMDb ids
ids = i.movies.collect(&:id).uniq
#=> ["0796366", "0060028", "0079945" ...]
Search IMDb id via Website
To find IMDb id for specific movies, you must go to:
http://www.imdb.com
Search for your movie of choice. Once you do, IMDb redirects you to the movie's page.
The URL for the redirect page includes the IMDB id.
http://www.imdb.com/title/tt0369610/
0369610 is the IMDb id.
Generate random IMDb ids (multi-thread setup)
You can fetch IMDb ids random. This approach will probably run you into some problems, see Disclaimer.
r = Random.new
39.times do |i|
m.async.fetch(sprintf '%07d', r.rand(300000))
sleep(4)
end
sleep(10)
Note: IMDB has a rate limit of 40 requests every 10 seconds and are limited by IP address, not API key. If you exceed the limit, you will receive a 429 HTTP status with a 'Retry-After' header. As soon your cool down period expires, you are free to continue making requests.
Also, movieDB will throw a NameError if the randomly generated IMDb id is invalid.
Get Movie Data
m.async.fetch("0369610", "3079380", "0478970")
By calling m.async, this instructs Celluloid that you would like for the given method to be called asynchronously. This means that rather than the caller waiting for a response of querying both IMDb and TMDb, the caller sends a message to the concurrent object that you'd like the given method invoked, and then the caller proceeds without waiting for a response. The concurrent object receiving the message will then process the method call in the background.
Asynchronous calls will never raise an exception, even if an exception occurs when the receiver is processing it.
Redis - caching objects
By default, any movie fetched from IMDb is stored in redis and has an expiration time of 1800 seconds (30 minutes).
But you can change this expiration time.
m.async.fetch("0369610", "3079380", expire: 86400)
Here, I set the expiration time to 86400 seconds which is equivalent to 24 hours.
Part 2 - Run the statistic
Below, we've collected 3 specific IMDb ids to analyze.
- Ant Man - 0369610
- Jurassic World - 079380
- Spy - 0478970
Finding the Mean value.
m.mean
Below is the result generated.
mean
ant-man 576.8444444444444
jurassic_world 512.5111111111111
spy 369.73333333333335
Below are more statistic methods you can invoke on your objects.
Feel free to try them out.
- std
- sum
- count
- max
- min
- product
- standardize
- describe
- covariance
- correlation
Layout and Template
movieDB allows you to view all your data fields in a worksheet style layout.
m.worksheet
A total of 45 fields are printed out. But, we've truncated the result for ease of reading.
ant-man jurassic_w spy
production 177 128 40
belongs_to 0 151 0
plot_synop 9083 0 9629
company 14 18 21
title 7 14 3
filming_lo 267 1037 530
cast_chara 4094 5894 1001
trailer_ur 0 46 45
cast_membe 2833 3452 939
votes 5 6 5
adult 5 5 5
also_known 928 1601 1195
director 15 19 13
plot_summa 373 298 311
countries 7 16 7
... ... ... ...
Filters
When performing statistics on an object, movieDB by default processes all fields.
However, you now have the option of filtering what fields you want processed using the following filters:
- only
- except
'only' analyzes the fields you provide.
'Except' is the inverse of 'only'. It analyzes all the fields you did not provide.
m.standardize only: [:budget, :revenue, :length, :vote_average]
Processes only budget, revenue, length and vote_average values.
ant-man jurassic_w spy
budget 1.49999999 -0.3616594 1.49999999
revenue -0.5000006 1.49304559 -0.5000013
length -0.4999988 -0.5656929 -0.4999976
vote_avera -0.5000005 -0.5656931 -0.5000010
Commands
movieDB comes with commands to help you query or manipulate stored objects in redis.
- HGETALL key Get all the fields and values in a hash of the movie
m.hgetall(["0369610"])
# => {"production_companies"=>"[{\"name\"=>\"Universal Studios\", \"id\"=>13},...}
- HKEYS key Get all the fields in a hash of the movie
m.hkeys
# => ["production_companies", "belongs_to_collection", "plot_synopsis", "company", "title",...]
- HVALS key Get all the values in a hash of the movie
m.hvals
# => ["[{\"name\"=>\"Universal Studios\", \"id\"=>13}, {\"name\"=>\"Amblin Entertainment\",...]
- ALL_IDS key Get all the id of movies
m.all_ids
# => ["0369610", "3079380"...]
- TTL key Gets the remaining time to live of a movie.
m.ttl("0369610")
# => 120
- DELETE key deletes a single movie object stored in redis.
m.del("0369610")
# => # => ["3079380"...]
- DELETE_ALL key deletes all movie objects stored in redis.
m.delete_all
# => []
Contact me
If you'd like to collaborate, please feel free to fork source code on github.
You can also contact me at albertmck@gmail.com
Disclaimer
This software is provided “as is” and without any express or implied warranties, including, without limitation, the implied warranties of merchantibility and fitness for a particular purpose. Neither I, nor any developer who contributed to this project, accept any kind of liability for your use of this library.
IMDB does not permit use of its data by third parties without their consent.
Using this library for anything other than limited personal use may result in an IP ban to the IMDB website.