html2rss
is a Ruby gem that generates RSS 2.0 feeds from websites automatically, and as a fallback via feed config.
With the feed config, you provide a URL to scrape and CSS selectors for extracting information (like title, URL, etc.). The gem builds the RSS feed accordingly. Extractors and chainable post processors make information extraction, processing, and sanitizing a breeze. The gem also supports scraping JSON responses and setting HTTP request headers.
Looking for a ready-to-use app to serve generated feeds via HTTP? Check out html2rss-web
!
Support the development by sponsoring this project on GitHub. Thank you! 💓
Installation
Install | gem install html2rss |
---|---|
Usage | html2rss help |
You can also install it as a dependency in your Ruby project:
🤩 Like it? | Star it! ⭐️ |
---|---|
Add this line to your Gemfile : |
gem 'html2rss' |
Then execute: | bundle |
In your code: | require 'html2rss' |
Generating a feed on the CLI
using automatic generation
html2rss offers an automatic RSS generation feature. Try it with:
html2rss auto https://unmatchedstyle.com/
creating a feed config file and using it
If the results are not to your satisfaction, you can create a feed config file.
Create a file called my_config_file.yml
with this sample content:
channel:
url: https://unmatchedstyle.com
selectors:
items:
selector: "article[id^='post-']"
title:
selector: h2
link:
selector: a
extractor: href
description:
selector: ".post-content"
post_process:
- name: sanitize_html
Build the feed from this config with: html2rss feed ./my_config_file.yml
.
Generating a feed with Ruby
Here's a minimal working example using Ruby:
require 'html2rss'
rss =
Html2rss.feed(
channel: { url: 'https://stackoverflow.com/questions' },
selectors: {
items: { selector: '#hot-network-questions > ul > li' },
title: { selector: 'a' },
link: { selector: 'a', extractor: 'href' }
}
)
puts rss
The feed config and its options
A feed config consists of a channel
and a selectors
hash. The contents of both hashes are explained below.
Good to know:
- You'll find extensive example feed configs at
spec/*.test.yml
. - See
html2rss-configs
for ready-made feed configs! - If you've created feed configs, you're invited to send a PR to
html2rss-configs
to make your config available to the public.
Alright, let's move on.
The channel
attribute | type | default | remark | |
---|---|---|---|---|
url |
required | String | ||
title |
optional | String | auto-generated | |
description |
optional | String | auto-generated | |
ttl |
optional | Integer | 360 |
TTL in minutes |
time_zone |
optional | String | 'UTC' |
TimeZone name |
language |
optional | String | 'en' |
Language code |
author |
optional | String | Format: email (Name)
|
|
headers |
optional | Hash | {} |
Set HTTP request headers. See notes below. |
json |
optional | Boolean | false |
Handle JSON response. See notes below. |
Dynamic parameters in channel
attributes
Sometimes there are structurally similar pages with different URLs. In such cases, you can add dynamic parameters to the channel's attributes.
Example of a dynamic id
parameter in the channel URLs:
channel:
url: "http://domainname.tld/whatever/%<id>s.html"
Command line usage example:
bundle exec html2rss feed the_feed_config.yml id=42
config = Html2rss::Config.new({ channel: { url: 'http://domainname.tld/whatever/%<id>s.html' } }, {}, { id: 42 })
Html2rss.feed(config)
See the more complex formatting options of the sprintf
method.
The selectors
First, you must give an items
selector hash, which contains a CSS selector. The selector selects a collection of HTML tags from which the RSS feed items are built. Except for the items
selector, all other keys are scoped to each item of the collection.
To build a valid RSS 2.0 item, you need at least a title
or a description
. You can have both.
Having an items
and a title
selector is enough to build a simple feed.
Your selectors
hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (due to the RSS 2.0 specification):
RSS 2.0 tag | name in html2rss
|
remark |
---|---|---|
title |
title |
|
description |
description |
Supports HTML. |
link |
link |
A URL. |
author |
author |
|
category |
categories |
See notes below. |
guid |
guid |
Default title/description. See notes below. |
enclosure |
enclosure |
See notes below. |
pubDate |
updated |
An instance of Time . |
comments |
comments |
A URL. |
source |
Not yet supported. |
The selector
hash
Every named selector in your selectors
hash can have these attributes:
name | value |
---|---|
selector |
The CSS selector to select the tag with the information. |
extractor |
Name of the extractor. See notes below. |
post_process |
A hash or array of hashes. See notes below. |
Using extractors
Extractors help with extracting the information from the selected HTML tag.
- The default extractor is
text
, which returns the tag's inner text. - The
html
extractor returns the tag's outer HTML. - The
href
extractor returns a URL from the tag'shref
attribute and corrects relative ones to absolute ones. - The
attribute
extractor returns the value of that tag's attribute. - The
static
extractor returns the configured static value (it doesn't extract anything). - See file list of extractors.
Extractors might need extra attributes on the selector hash. 👉 Read their docs for usage examples.
Html2rss.feed(
channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
)
channel:
# ... omitted
selectors:
# ... omitted
link:
selector: "a"
extractor: "href"
Using post processors
Extracted information can be further manipulated with post processors.
name | |
---|---|
gsub |
Allows global substitution operations on Strings (Regexp or simple pattern). |
html_to_markdown |
HTML to Markdown, using reverse_markdown. |
markdown_to_html |
converts Markdown to HTML, using kramdown. |
parse_time |
Parses a String containing a time in a time zone. |
parse_uri |
Parses a String as URL. |
sanitize_html |
Strips unsafe and uneeded HTML and adds security related attributes. |
substring |
Cuts a part off of a String, starting at a position. |
template |
Based on a template, it creates a new String filled with other selectors values. |
⚠️ Always make use of the sanitize_html
post processor for HTML content. Never trust the internet! ⚠️
Chaining post processors
Pass an array to post_process
to chain the post processors.
channel:
# ... omitted
selectors:
# ... omitted
price:
selector: '.price'
description:
selector: '.section'
post_process:
- name: template
string: |
# %{self}
Price: %{price}
- name: markdown_to_html
Post processor gsub
The post processor gsub
makes use of Ruby's gsub
method.
key | type | required | note |
---|---|---|---|
pattern |
String | yes | Can be Regexp or String. |
replacement |
String | yes | Can be a backreference. |
Html2rss.feed(
channel: {},
selectors: {
title: { selector: 'a', post_process: [{ name: 'gsub', pattern: 'foo', replacement: 'bar' }] }
}
)
channel:
# ... omitted
selectors:
# ... omitted
title:
selector: "a"
post_process:
- name: "gsub"
pattern: "foo"
replacement: "bar"
Adding <category>
tags to an item
The categories
selector takes an array of selector names. Each value of those
selectors will become a <category>
on the RSS item.
Html2rss.feed(
channel: {},
selectors: {
genre: {
# ... omitted
selector: '.genre'
},
branch: { selector: '.branch' },
categories: %i[genre branch]
}
)
channel:
# ... omitted
selectors:
# ... omitted
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branch
Custom item GUID
By default, html2rss generates a GUID from the title
or description
.
If this does not work well, you can choose other attributes from which the GUID is build. The principle is the same as for the categories: pass an array of selectors names.
In all cases, the GUID is a SHA1-encoded string.
Html2rss.feed(
channel: {},
selectors: {
title: {
# ... omitted
selector: 'h1'
},
link: { selector: 'a', extractor: 'href' },
guid: %i[link]
}
)
channel:
# ... omitted
selectors:
# ... omitted
title:
selector: "h1"
link:
selector: "a"
extractor: "href"
guid:
- link
Adding an <enclosure>
tag to an item
An enclosure can be any file, e.g. a image, audio or video - think Podcast.
The enclosure
selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
Since html2rss
does no further inspection of the enclosure, its support comes with trade-offs:
- The content-type is guessed from the file extension of the URL.
- If the content-type guessing fails, it will default to
application/octet-stream
. - The content-length will always be undetermined and therefore stated as
0
bytes.
Read the RSS 2.0 spec for further information on enclosing content.
Html2rss.feed(
channel: {},
selectors: {
enclosure: { selector: 'audio', extractor: 'attribute', attribute: 'src' }
}
)
channel:
# ... omitted
selectors:
# ... omitted
enclosure:
selector: "audio"
extractor: "attribute"
attribute: "src"
By default, html2rss
assumes the URL responds with HTML. However, it can also handle JSON responses. The JSON must return an Array or Hash.
key | required | default | note |
---|---|---|---|
json |
optional | false | If set to true , the response is parsed as JSON. |
jsonpath |
optional | $ | Use JSONPath syntax to select nodes of interest. |
Html2rss.feed(
channel: { url: 'http://domainname.tld/whatever.json', json: true },
selectors: { title: { selector: 'foo' } }
)
channel:
url: "http://domainname.tld/whatever.json"
json: true
selectors:
title:
selector: "foo"
Set any HTTP header in the request
To set HTTP request headers, you can add them to the channel's headers
hash. This is useful for APIs that require an Authorization header.
channel:
url: "https://example.com/api/resource"
headers:
Authorization: "Bearer YOUR_TOKEN"
selectors:
# ... omitted
Or for setting a User-Agent:
channel:
url: "https://example.com"
headers:
User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
selectors:
# ... omitted
Usage with a YAML config file
This step is not required to work with this gem. If you're using
html2rss-web
and want to create your private feed configs, keep on reading!
First, create a YAML file, e.g. feeds.yml
. This file will contain your global config and multiple feed configs under the key feeds
.
Example:
headers:
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
feeds:
myfeed:
channel:
selectors:
myotherfeed:
channel:
selectors:
Your feed configs go below feeds
. Everything else is part of the global config.
Find a full example of a feeds.yml
at spec/fixtures/feeds.test.yml
.
Now you can build your feeds like this:
require 'html2rss'
myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')
html2rss feed feeds.yml myfeed
html2rss feed feeds.yml myotherfeed
Display the RSS feed nicely in a web browser
To display RSS feeds nicely in a web browser, you can:
- add a plain old CSS stylesheet, or
- use XSLT (eXtensible Stylesheet Language Transformations).
A web browser will apply these stylesheets and show the contents as described.
In a CSS stylesheet, you'd use element
selectors to apply styles.
If you want to do more, then you need to create a XSLT. XSLT allows you to use a HTML template and to freely design the information of the RSS, including using JavaScript and external resources.
You can add as many stylesheets and types as you like. Just add them to your global configuration.
config = Html2rss::Config.new(
{ channel: {}, selectors: {} }, # omitted
{
stylesheets: [
{
href: '/relative/base/path/to/style.xls',
media: :all,
type: 'text/xsl'
},
{
href: 'http://example.com/rss.css',
media: :all,
type: 'text/css'
}
]
}
)
Html2rss.feed(config)
stylesheets:
- href: "/relative/base/path/to/style.xls"
media: "all"
type: "text/xsl"
- href: "http://example.com/rss.css"
media: "all"
type: "text/css"
feeds:
# ... omitted
Recommended further readings:
- How to format RSS with CSS on lifewire.com
- XSLT: Extensible Stylesheet Language Transformations on MDN
- The XSLT used by html2rss-web
Gotchas and tips & tricks
- Check that the channel URL does not redirect to a mobile page with a different markup structure.
- Do not rely on your web browser's developer console.
html2rss
does not execute JavaScript. - Fiddling with
curl
andpup
to find the selectors seems efficient (curl URL | pup
). - CSS selectors are versatile. Here's an overview.
Contributing
Find ideas what to contribute in:
- https://github.com/orgs/html2rss/discussions
- the issues tracker: https://github.com/html2rss/html2rss/issues
Development Helpers
-
bin/setup
: installs dependencies and sets up the development environment. -
bin/guard
: automatically runs rspec, rubocop and reek when a file changes. - for a modern Ruby development experience: install
ruby-lsp
and integrate it to your IDE: a. Ruby in Visual Studio Code
How to submit changes
- Fork this repo ( https://github.com/html2rss/html2rss/fork )
- Create your feature branch (
git checkout -b my-new-feature
) - Implement a commit your changes (
git commit -am 'feat: add XYZ'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request using the Github web UI