Fixing `ValueError: Comment may not contain '--' or end with '-'` in lxml.html.soupparser

Posted on 2017-03-21 In Programming Languages , Python

The initial error

While loading HTML with lxml.html.soupparser.fromstring() I hit:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

Comparing the failing document with others, I found ASCII control characters—the “control characters” mentioned in the error. The workaround below removes them (adapted from https://github.com/html5lib/html5lib-python/issues/96):

import re
def remove_control_characters(html):
    def str_to_int(s, default, base=10):
        if int(s, base) < 0x10000:
            return unichr(int(s, base))
        return default
    html = re.sub(ur"&#(\d+);?", lambda c: str_to_int(c.group(1), c.group(0)), html)
    html = re.sub(ur"&#[xX]([0-9a-fA-F]+);?", lambda c: str_to_int(c.group(1), c.group(0), base=16), html)
    html = re.sub(ur"[\x00-\x08\x0b\x0e-\x1f\x7f]", "", html)
    return html

Another option from Stack Overflow removes all characters whose Unicode category starts with C:

1
2
3

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

The first helper solved the initial failure.

A new exception

The production server still crashed with:

*ValueError: Comment may not contain ‘–’ or end with ‘-‘

The only difference was the lxml version: my machine ran 3.4.4 while the server used 3.6.4. Upgrading locally to 3.7.3 reproduced the error.

Hunting for answers

Search results suggested replacing the local html5parser.py with the latest version from GitHub, but that only applies to html5parser.fromstring, not soupparser.fromstring.

I dove into the documentation and source and noted:

soupparser.fromstring loads the HTML with BeautifulSoup, then calls html.parser.makeelement to build the tree.
The function accepts a custom makeelement, but there is no documentation on writing one.
makeelement comes from lxml.etree, which is implemented in Cython.
The exceptions above originate in etree.
BeautifulSoup already mutates comment nodes.

Because etree is not pure Python, overriding its behavior would be messy. Passing different BeautifulSoup implementations (bs3, bs4, customized subclasses) also failed.

Back to the beginning

After stripping control characters, the older lxml version worked fine. Checking the changelog showed that starting with 3.5.0b1—when soupparser switched from bs3 to bs4—the comment validation appeared. Rolling back to 3.4.4 on the server and sanitizing the HTML upstream turned out to be the pragmatic fix.