csvget (also, jsonget)¶ ↑
What’s new in 0.3.0¶ ↑
-
Added command-line option:
--filter=RUBY_CODE RUBY_CODE will be eval'd in context of @row.is_a?(FasterCSV::Row) # Example: ./bin/csvget --require=chronic --require=time --filter "@row['time']=Chronic.parse(@row['time']).iso8601" # ...
-
Added CHANGELOG
Dependencies¶ ↑
-
github.com/fizx/parsley/tree/master and its dependencies.
-
Rubygems
Running on EC2¶ ↑
> git clone git://github.com/fizx/csvget-ec2-recipe.git > cd csvget-ec2-recipe > ./boot.rb > ssh YOUR_INSTANCE
Local Installation¶ ↑
-
Install the dependencies.
-
> gem sources -a gems.github.com
-
> sudo gem install fizx-csvget
Example Usage¶ ↑
> cat hn.let { "headlines":[{ "title": ".title a", "link": ".title a @href", "comments": "match(.subtext a:nth-child(3), '\\d+')", "user": ".subtext a:nth-child(2)", "score": "match(.subtext span, '\\d+')", "time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')" }] } > csvget --directory-prefix=./data -A "/x" -w 5 --parselet=hn.let http://news.ycombinator.com/ > head data/headlines.csv comments,title,time,link,score,user 4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,http://en.wikipedia.org/wiki/Simpson%27s_paradox,41,waldrews 67,America's unjust sex laws,2 hours ago,http://www.economist.com/opinion/displaystory.cfm?story_id=14165460,59,MikeCapone 23,Buy somebody lunch,3 hours ago,http://www.whattofix.com/blog/archives/2009/08/buy-somebody-lu.php,58,DanielBMarkham 10,A design pattern is an artifact of a missing feature in your chosen language,3 hours ago,http://www.snell-pym.org.uk/archives/2008/12/29/design-patterns/,31,bensummers 4,API changes in Snow Leopard,1 hour ago,http://developer.apple.com/mac/library/releasenotes/MacOSX/WhatsNewInOSX/Articles/MacOSX10_6.html#//apple_ref/doc/uid/TP40008898-SW1,14,pieter 16,How to run a linux based home web server,3 hours ago,http://stevehanov.ca/blog/index.php?id=73,28,RiderOfGiraffes 1,"OpenCL ""Hello World""",1 hour ago,"http://developer.apple.com/mac/library/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Example:Hello,World/Example:Hello,World.html",8,pieter 15,US Senate bill allows White House to disconnect private computers from Internet,4 hours ago,http://news.cnet.com/8301-13578_3-10320096-38.html,35,drewr 1,Strategy: Solve Only 80 Percent of the Problem,47 minutes ago,http://highscalability.com/strategy-solve-only-80-percent-problem,6,alrex021 > csvget -h Usage: ./bin/csvget [options] SEED_URL [SEED_URL2 ...] --parselet=JSON_FILE JSON_FILE is a parselet. -w, --wait=SECONDS wait SECONDS between retrievals. -P, --directory-prefix=PREFIX save files to PREFIX/... -U, --user-agent=AGENT identify as AGENT instead of RWget/VERSION. -A, --accept-pattern=RUBY_REGEX URLs must match RUBY_REGEX to be saved to the queue. --time-limit=AMOUNT Crawler will stop after this AMOUNT of time has passed. -R, --reject-pattern=RUBY_REGEX URLs must NOT match RUBY_REGEX to be saved to the queue. --require=RUBY_SCRIPT Will execute 'require RUBY_SCRIPT' --limit-rate=RATE limit download rate to RATE. --http-proxy=URL Proxies via URL --proxy-user=USER Sets proxy user to USER --proxy-password=PASSWORD Sets proxy password to PASSWORD --fetch-class=RUBY_CLASS Must implement fetch(uri, user_agent_string) #=> [final_redirected_url, file_object] --store-class=RUBY_CLASS Must implement put(key_string, temp_file) --dupes-class=RUBY_CLASS Must implement dupe?(uri) --queue-class=RUBY_CLASS Must implement put(key_string, depth_int) and get() #=> [key_string, depth_int] --links-class=RUBY_CLASS Must implement urls(base_uri, temp_file) #=> [uri, ...] -S, --sitemap=URL URL of a sitemap to crawl (will ignore inter-page links) -V, --version -Q, --quota=NUMBER set retrieval quota to NUMBER. --max-redirect=NUM maximum redirections allowed per page. -H, --span-hosts go to foreign hosts when recursive --connect-timeout=SECS set the connect timeout to SECS. -T, --timeout=SECS set all timeout values to SECONDS. -l, --level=NUMBER maximum recursion depth (inf or 0 for infinite). --[no-]timestampize Prepend the timestamp of when the crawl started to the directory structure. --incremental-from=PREVIOUS Build upon the indexing already saved in PREVIOUS. --protocol-directories use protocol name in directories. --no-host-directories don't create host directories. -v, --[no-]verbose Run verbosely -h, --help Show this message
Copyright¶ ↑
Copyright © 2009 Kyle Maxwell. See LICENSE for details (MIT).