Fixing `ValueError: Comment may not contain '--' or end with '-'` in lxml.html.soupparser
The initial error
While loading HTML with lxml.html.soupparser.fromstring()
I hit:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
Comparing the failing document with others, I found ASCII control characters—the “control characters” mentioned in the error. The workaround below removes them (adapted from https://github.com/html5lib/html5lib-python/issues/96):
1 | import re |
Another option from Stack Overflow removes all characters whose Unicode category starts with C
:
1 | import unicodedata |
The first helper solved the initial failure.
A new exception
The production server still crashed with:
*ValueError: Comment may not contain ‘–’ or end with ‘-‘
The only difference was the lxml version: my machine ran 3.4.4 while the server used 3.6.4. Upgrading locally to 3.7.3 reproduced the error.
Hunting for answers
Search results suggested replacing the local html5parser.py
with the latest version from GitHub, but that only applies to html5parser.fromstring
, not soupparser.fromstring
.
I dove into the documentation and source and noted:
soupparser.fromstring
loads the HTML with BeautifulSoup, then callshtml.parser.makeelement
to build the tree.- The function accepts a custom
makeelement
, but there is no documentation on writing one. makeelement
comes fromlxml.etree
, which is implemented in Cython.- The exceptions above originate in
etree
. - BeautifulSoup already mutates comment nodes.
Because etree
is not pure Python, overriding its behavior would be messy. Passing different BeautifulSoup implementations (bs3, bs4, customized subclasses) also failed.
Back to the beginning
After stripping control characters, the older lxml version worked fine. Checking the changelog showed that starting with 3.5.0b1—when soupparser switched from bs3 to bs4—the comment validation appeared. Rolling back to 3.4.4 on the server and sanitizing the HTML upstream turned out to be the pragmatic fix.