Prevent lxml from Reordering HTML Tags

Problem

While processing a page with lxml I noticed that closing </font> tags moved outside the <p> elements they originally wrapped. Whenever a <font> contained <p> elements and sat inside a <div>, the closing tag jumped outside the block so the paragraphs became siblings.

Root cause

Searching online turned up nothing, so I checked the libraries involved. I was using lxml.html.soupparser, which delegates to BeautifulSoup. In BeautifulSoup’s source I found:

1
2
3
4
5
6
7
8
9
10
#According to the HTML standard, each of these inline tags can
#contain another tag of the same type. Furthermore, it's common
#to actually use these tags this way.
NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
'center')

#According to the HTML standard, these block tags can contain
#another tag of the same type. Furthermore, it's common
#to actually use these tags this way.
NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')

BeautifulSoup normalizes malformed HTML according to the standard, and these lists define which tags may nest inside themselves. Our HTML did not follow that pattern, hence the rewrite.

Fix

Patch the source

You could delete the offending entries directly in BeautifulSoup, but that would affect every project that uses it. Copying the modified version into your project is possible, yet anything else that relies on BeautifulSoup would behave differently.

Subclass

A better option is to subclass and adjust the attributes. Because we are calling lxml, we need a BeautifulSoup subclass but do not have to touch lxml: lxml.html.soupparser.fromstring() accepts a beautifulsoup argument. Pass your subclass there.

1
2
3
4
5
6
7
8
9
from BeautifulSoup import BeautifulSoup


class PatchedBeautifulSoup(BeautifulSoup):
def __init__(self, *args, **kwargs):
super(PatchedBeautifulSoup, self).__init__(*args, **kwargs)
super(PatchedBeautifulSoup, self).NESTABLE_TAGS.pop('font', None)
super(PatchedBeautifulSoup, self).NESTABLE_TAGS.pop('div', None)
super(PatchedBeautifulSoup, self).RESET_NESTING_TAGS.pop('div', None)

Curiously, overriding NESTABLE_INLINE_TAGS and NESTABLE_BLOCK_TAGS alone had no effect. Removing font and div from NESTABLE_TAGS and RESET_NESTING_TAGS did the trick.

Beyond reordering

lxml also normalizes other quirks—quoting attributes, reordering them, and so on. That is usually helpful when dealing with valid HTML, but if you ingest a lot of messy markup, watch for these automatic cleanups.