Subscribe to RSS Feed home

Parsing HTML with Python

tyler@linux.com published on 03 May 2008 A while ago I jotted down about seven or so ideas of stuff that I thought would make good blog posts, somehow "markup parsers in Python" is next on the list, so I might as well spill the beans on how incredibly easy it is to process (X)HTML with Python and a little built in class called HTMLParser.

There have been a few occasions when I needed a quick (and dirty) way to perform transforms on some chunk of HTML or merely "search and replace" parts of it. While it might be cleaner to do something with XSLT or the likes, using them doesn't even begin to match the speed of development of an HTMLParser-based class in Python.

Getting Started
One major thing to keep in mind when working with HTMLParser, especially if you're newer to Python, is that it is what's referred to as an "old styled" object, meaning subclassing it is a bit different than "new styled" classes. Since HTMLParser is an old-styled object, any time you'd want to call a super-class defined method you would need to perform HTMLParser.superMethod(arg) instead of super(SubHTMLParser, self).superMethod(arg)


Creating the HTML parser
For the purposes of this example, I want something simple, so we're just going to take a block of markup and "tweak" all the <a> tags within it to be "sad" (whereas "sad" means they'll be bold, blue, and blinkey). The actual code to do so is only 50 lines long and is as follows: import HTMLParser

class SadHTML(HTMLParser.HTMLParser):
'''A simple HTML transform-class based upon HTMLParser. All links shall be bold, blue and blinky :('''

def __init__(self, *args, **kwargs):
HTMLParser.HTMLParser.__init__(self)
self.stack = []

def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
if tag.lower() == 'a':
self.stack.append(self.__html_start_tag('blink', None))
attrs['style'] = '%s%s' % (attrs.get('style', ''), 'color: blue; font-weight: bold;')
self.stack.append(self.__html_start_tag(tag, attrs))

def handle_endtag(self, tag):
self.stack.append(self.__html_end_tag(tag))
if tag.lower() == 'a':
self.stack.append(self.__html_end_tag('blink'))

def handle_startendtag(self, tag, attrs):
self.stack.append(self.__html_startend_tag(tag, attrs))

def handle_data(self, data):
self.stack.append(data)

def __html_start_tag(self, tag, attrs):
return '<%s%s>' % (tag, self.__html_attrs(attrs))

def __html_startend_tag(self, tag, attrs):
return '<%s%s/>' % (tag, self.__html_attrs(attrs))

def __html_end_tag(self, tag):
return '' % (tag)

def __html_attrs(self, attrs):
_attrs = ''
if attrs:
_attrs = ' %s' % (' '.join([('%s="%s"' % (k,v)) for k,v in attrs.iteritems()]))
return _attrs

@classmethod
def depreshun(cls, markup):
_p = cls()
_p.feed(markup)
_p.close()
return ''.join(_p.stack)


The actual ins-and-outs of the parser are very simple; markup like "<a href="#">Hello</a><br/>" would execute accordingly:
  • handle_starttag('a', [('href', '#')])
  • handle_data('Hello')
  • handle_endtag('a')
  • handle_startendtag('br', [])


Since HTMLParser just gives you element tag names, and there attributes, SadHTML simply builds a list of strings out of what data is passed to it via the super class and then when everything is finished, ties the list back together with: ''.join(list_of_tags).
Executing the SadHTML.depreshun method on the contents of my last blog post is a good example, part of the post was:
An informal poll at the Slide offices this past week yielded these interesting results: at Slide.com, nearly 100% of white people seem to like "Stuff White People Like".


After running it through "SadHTML", the following markup is generated instead:
An informal poll at the Slide offices this past week yielded these interesting results: at Slide.com, nearly 100% of white people seem to like "Stuff White People Like".


If you're curious as to how much more you can do with HTMLParser, do check out the documentation. It's far more lenient than using eXpat for parsing HTML, and it's still fast enough to be used on longer documents (there's also htmllib available for Python but I've not used it yet).