Comparing HTML Parsers in lxml.html

Posted on 2017-05-30 In Programming Languages , Python

Since version 2.0 lxml ships lxml.html, a dedicated HTML module with several parser options. Loading a snippet can look like this:

from lxml.html import fromstring, soupparser, html5lib

html = '<html><a>x=1</a></html>'

a = fromstring(html)
b = soupparser.fromstring(html)
c = html5lib.fromstring(html)

How they differ

Core differences

lxml.html.fromstring uses lxml’s HTML parser (etree.HTMLParser).

soupparser.fromstring runs BeautifulSoup under the hood with Python’s built-in html.parser.

According to the docs, BeautifulSoup excels at detecting encodings when tags lie about them.

html5lib.fromstring relies on the html5lib package, which implements the HTML5 parsing algorithm used by modern browsers.

BeautifulSoup’s Chinese documentation neatly summarizes the trade-offs:

Parser	Usage	Pros	Cons
stdlib (`html.parser`)	`BeautifulSoup(markup, "html.parser")`	Built-in, reasonable speed, good tolerance	Older Python versions were less forgiving
lxml HTML parser	`BeautifulSoup(markup, "lxml")`	Fast, tolerant	Requires C extensions
html5lib	`BeautifulSoup(markup, "html5lib")`	Browser-like behavior, best error recovery, outputs HTML5	Slow, pure Python

Handling broken HTML

The same docs show how each parser cleans up malformed markup:

BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>

BeautifulSoup("<a></p>", "html5lib")
# <html><head></head><body><a><p></p></a></body></html>

BeautifulSoup("<a></p>", "html.parser")
# <a></a>

html5lib adds the missing tags; lxml and html.parser drop them, though html.parser does not inject <html> and <body> wrappers.

The lxml docs add another example:

html5lib normalizes certain element structures. Even if a table lacks <tbody>, html5lib inserts one:
1
2
3
>>> from lxml.html import tostring, html5parser
>>> tostring(html5parser.fromstring("<table><td>foo"))
'<table><tbody><tr><td>foo</td></tr></tbody></table>'

The standard HTML parser handles slightly broken HTML, but for true tag soup you may prefer BeautifulSoup via ElementSoup.