Website Cloner
Website Cloner is a Ruby gem that allows you to create local copies of websites, including all assets and linked pages. It's designed to be easy to use while providing powerful features for customization.
Features
- Downloads the main page and all linked pages up to a specified limit
- Stores images, CSS, JavaScript, and other assets locally
- Updates references to point to local assets
- Maintains directory structure for pages
- Provides colored logging for better visibility
- Supports authenticated access through session cookies
- Handles relative and absolute URLs correctly
- Organizes downloaded files into appropriate directories (css, js, assets)
Installation
Add this line to your application's Gemfile:
gem 'website_cloner'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install website_cloner
Dependencies: nokogiri v1.15, httparty v0.21, openssl v3.0
Usage
Command Line Interface
The Website Cloner can be used from the command line:
website-cloner https://example.com -m 10 -s "user_session=endoded_cookie_string"
This example will clone https://example.com, download up to 10 pages, and store the result in the ./example.com directory (default output directory based on the domain).
Options:
-
-m, --max-pages PAGES
: Maximum number of pages to clone (default: 20) -
-s, --session-cookie COOKIE
: Session cookie for authenticated access -
-h, --help
: Prints help information
Default Output Directory: If no output_directory is provided, the cloned website will be stored in a folder named after the domain of the URL (e.g., ./example.com for https://example.com).
In Ruby Scripts
You can also use Website Cloner in your Ruby scripts:
require 'website_cloner'
url = "https://example.com"
output_dir = "./cloned_site"
options: {
max_pages: 50,
session_cookie: "session_id=abc123; user_token=xyz789"
}
WebsiteCloner.clone(url, output_dir, **options)
Configuration
Website Cloner uses sensible defaults, but you can configure it to suit your needs:
-
max_pages
: Controls the maximum number of pages to clone (default: 20) -
session_cookie
: Allows authenticated access to websites that require login
Logging
Website Cloner provides colored logging for better visibility. Log messages are output to the console and include information about the cloning process, any errors encountered, and the final status of the operation.
Best Practices
- Always respect the website's
robots.txt
file and terms of service. - Be mindful of the load you're putting on the target server. Consider adding delays between requests for busy sites.
- When using session cookies, ensure you have permission to access and clone the authenticated content.
- Be cautious with cloned data, especially if it contains sensitive information from authenticated pages.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/bhavyansh001/website_cloner. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
Website Cloner is released under the MIT License.