Using Hpricot to Scrub HTML

[UPDATE 2007-01-10] I’ve updated the scrubber, see Hpricot Scrub for more. [/UPDATE]

I went looking for a Ruby replacement for Html::Scrubber in perl for a gig and came up blank. Can it really be possible the nobody is doing anything more than blindly stripping tags?

I had seen Hpricot and thought I needed to find a reason to use it, well here it is. I monkey patched a couple methods into Hrpicot and off I went.

Here’s the Hpricot bits.

module Hpricot
  class Elements
    def strip
      each { |x| x.strip }
    end

    def strip_attributes(safe=[], patterns={})
      each { |x| x.strip_attributes(safe, patterns) }
    end
  end

  class Elem
    def strip
      parent.replace_child self, Hpricot.make(inner_html) unless
        parent.nil?
    end

    def strip_attributes(safe=[], patterns={})
      attributes.each { |atr|
          pat = patterns[atr[0].to_sym] || ''
          remove_attribute(atr[0]) unless safe.include?(atr[0]) &&
            atr[1].match(pat)
      } unless attributes.nil?
    end
  end
end

Just that bit get’s me to the point where I can do things like this

doc = Hpricot(open('http://slashdot.org/').read)

# remove all anchors leaving behind the text inside.
(doc/:a).strip

# strip all attributes except for src from all images
(doc/:img).strip_attributes(['src'])

Then I made scrubber that passes in the array and hash to those methods to handle the dirty work. It looks like this, though I’m also using Tidy so mine is alittle different.

class HtmlScrubber
  @@config = YAML.load_file(
    "#{RAILS_ROOT}/config/html_scrubber.yml") unless
      defined?(@@config)

  def self.scrub(markup)
    doc = Hpricot(markup || '', :xhtml_strict => true)
    raise 'No markup specified' if doc.nil?
    @@config[:nuke_tags].each { |tag| (doc/tag).remove }
    @@config[:allow_tags].each { |tag|
      (doc/tag).strip_attributes(@@config[:allow_attributes],
        @@config[:attribute_patterns]) }
    doc.traverse_all_element {|e|
      e.strip unless @@config[:allow_tags].include?(e.name)
    }
    doc.inner_html
  end
end

Here is a zip of the code and a sample config: html_scrubber.zip