Python is good at unicode. Python has a convenient API for parsing XML. Python is even good at parsing unicode XML. Great!

However, it turns out that there are a bunch of unicode characters that are actually illegal in XML. Wha? Oh, I forgot to read section 2.2 of the XML 1.0 standard.

Check this out. If we run the following code:

import xml.dom.minidom
x = u"<foo>text\u001a</foo>"
dom = xml.dom.minidom.parseString(x.encode("utf-8"))

We get a nice exception:

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

Our friend, the \u001a character, has caused a little problem. While this character is legal utf-8, it is not legal in xml, as I figured out from reading this page: http://boodebr.org/main/python/all-about-python-and-unicode. This page contains lots of info, but it doesn’t contain the simple solution I need: replace illegal unicode characters with another character (say, “?”). Here’s some sample code that does it. Enjoy in good health.

import xml.dom.minidom
import re

# from http://boodebr.org/main/python/all-about-python-and-unicode#UNI_XML
RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \
                 u'|' + \
                 u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
                  (unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                   unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                   unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff))
x = u"<foo>text\u001a</foo>"
x = re.sub(RE_XML_ILLEGAL, "?", x)
dom = xml.dom.minidom.parseString(x.encode("utf-8"))

The power is all in the regex, which is thanks to boodebr.org. Change the regex matching as you will. Comments informing me of my code’s crappiness are welcome.

Max