Prevent lxml from Reordering HTML Tags
Problem
While processing a page with lxml I noticed that closing </font>
tags moved outside the <p>
elements they originally wrapped. Whenever a <font>
contained <p>
elements and sat inside a <div>
, the closing tag jumped outside the block so the paragraphs became siblings.
Root cause
Searching online turned up nothing, so I checked the libraries involved. I was using lxml.html.soupparser
, which delegates to BeautifulSoup. In BeautifulSoup’s source I found:
1 | #According to the HTML standard, each of these inline tags can |
BeautifulSoup normalizes malformed HTML according to the standard, and these lists define which tags may nest inside themselves. Our HTML did not follow that pattern, hence the rewrite.
Fix
Patch the source
You could delete the offending entries directly in BeautifulSoup, but that would affect every project that uses it. Copying the modified version into your project is possible, yet anything else that relies on BeautifulSoup would behave differently.
Subclass
A better option is to subclass and adjust the attributes. Because we are calling lxml, we need a BeautifulSoup subclass but do not have to touch lxml: lxml.html.soupparser.fromstring()
accepts a beautifulsoup
argument. Pass your subclass there.
1 | from BeautifulSoup import BeautifulSoup |
Curiously, overriding NESTABLE_INLINE_TAGS
and NESTABLE_BLOCK_TAGS
alone had no effect. Removing font
and div
from NESTABLE_TAGS
and RESET_NESTING_TAGS
did the trick.
Beyond reordering
lxml also normalizes other quirks—quoting attributes, reordering them, and so on. That is usually helpful when dealing with valid HTML, but if you ingest a lot of messy markup, watch for these automatic cleanups.