Project

hscrubber

0.0
Repository is archived
No commit activity in last 3 years
No release in over 3 years
hscrubber is HTML scrubber based on a HTML reha filter
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 1.0.0
~> 2.0.1

Runtime

>= 0.8.4
>= 0.1
 Project Readme

HScrubber

HScrubber is HTML reha engine, and it allows filtering an input flow according to the special reha template that is formed as YAML-document.

Reha

Description of reha filter

Reha is set up as an YAML-document. The allowed in an output flow HTML tags is described at the top level of the document. The following level described allowed attributes of the specified tag, and also rule keys that controls the tag and ots containment. The keys, and its values are the following:

  • '_' declares that the containment of the tag will be cleaned up, if it matches to the specified rule;

  • '-' a tag will be removed, if its containment matches to the specified rule;

  • '^' containment of a tag will be added to containment of the parent tag, if containment of the tag matches to the specified rule, or if the rule isn't defined;

  • '%' sets the attributes order in the output file. The attributes is writing via comma.

The keys are ranged according to priority their analysing. The '@' symbol necessarily outruns each of the keys.

Sample

Sample reha template is described as follows:

---
html:
body:
p:
i:
  @-: ^[.,:;!?\s]*$
font:
  face:
  size:
  @%: size,face
  @-: ^\s+$
  @_: ^[.,:!?#]+$
span:
  @^:
  @-: ^[.,:;!?\s]*$

Descriptions:

Tag 'i' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output;

<i id="i_id">Text</i> -> <i>Text</i>
<font>Text<i>?</i></font> -> <font>Text</font>

Allowable attributes for the 'font' tag are 'face', and 'size'. In case, if the tag containment meets a remove rule, the tag will be absent in the output, and if meets a cleanup rule, the containment will be purged, and the attributes will be ordered as 'size', and then 'face';

<font size="5" color="blue">Text</font> -> <font size="5">Text</font>
<i>Text<font>  </font></i> -> <i>Text</i>
<i>Text<font>??</font></i> -> <i>Text<font></font></i>
<font face="Arial" size="5">Text</font> -> <font size="5" face="Arial">Text</font>

Tag 'span' hasn't allowable attributes, so them will be removed from an output stream. In case, the tag containment meets a remove rule, the tag will be absent in the output as it is. In other cases, its containment will be added to a parent tag.

<span id="span_id">Text</span> -> <span>Text</span>
<i>Text<span>?</span></i> -> <i>Text</i>

Usage

There are 2 ways to use the package in ruby applications.

Using the class instance method

Make a class instance, passing a reha to its initialize function. The reha must be loaded as a String, or an IO class. Then filter a HTML:

рѣха = IO.read('.рѣха.yml.sample')
hs = HScrubber.new(рѣха)

html = IO.read('sample.html').gsub(/\r/, '')
new_html = hs.scrub_html(html)

puts html

Using the class method

Thou art able to filter the HTML-document without a class instance creation. Do as follows:

рѣха = IO.read('.рѣха.yml.sample')
html = IO.read('sample.html').gsub(/\r/, '')
new_html = HScrubber.scrub_html(html, рѣха)

puts html

Copyright

Copyright (c) 2011 Malo Skrylevo. See LICENSE for details.