Comparing HTML Parsers in lxml.html
Since version 2.0 lxml ships lxml.html
, a dedicated HTML module with several parser options. Loading a snippet can look like this:
1 | from lxml.html import fromstring, soupparser, html5lib |
How they differ
Core differences
lxml.html.fromstring
uses lxml’s HTML parser (etree.HTMLParser
).
soupparser.fromstring
runs BeautifulSoup under the hood with Python’s built-in html.parser
.
According to the docs, BeautifulSoup excels at detecting encodings when tags lie about them.
html5lib.fromstring
relies on the html5lib package, which implements the HTML5 parsing algorithm used by modern browsers.
BeautifulSoup’s Chinese documentation neatly summarizes the trade-offs:
Parser | Usage | Pros | Cons |
---|---|---|---|
stdlib (html.parser ) |
BeautifulSoup(markup, "html.parser") |
Built-in, reasonable speed, good tolerance | Older Python versions were less forgiving |
lxml HTML parser | BeautifulSoup(markup, "lxml") |
Fast, tolerant | Requires C extensions |
html5lib | BeautifulSoup(markup, "html5lib") |
Browser-like behavior, best error recovery, outputs HTML5 | Slow, pure Python |
Handling broken HTML
The same docs show how each parser cleans up malformed markup:
1 | BeautifulSoup("<a></p>", "lxml") |
html5lib adds the missing tags; lxml and html.parser
drop them, though html.parser
does not inject <html>
and <body>
wrappers.
The lxml docs add another example:
html5lib normalizes certain element structures. Even if a table lacks
<tbody>
, html5lib inserts one:
1
2
3 from lxml.html import tostring, html5parser
"<table><td>foo")) tostring(html5parser.fromstring(
'<table><tbody><tr><td>foo</td></tr></tbody></table>'
The standard HTML parser handles slightly broken HTML, but for true tag soup you may prefer BeautifulSoup via ElementSoup.