Python’s minidom, xml, and illegal unicode characters

Python is good at unicode. Python has a convenient API for parsing XML. Python is even good at parsing unicode XML. Great!

However, it turns out that there are a bunch of unicode characters that are actually illegal in XML. Wha? Oh, I forgot to read section 2.2 of the XML 1.0 standard.

Check this out. If we run the following code:

import xml.dom.minidom
x = u"<foo>text\u001a</foo>"
dom = xml.dom.minidom.parseString(x.encode("utf-8"))

We get a nice exception:

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

Our friend, the \u001a character, has caused a little problem. While this character is legal utf-8, it is not legal in xml, as I figured out from reading this page: http://boodebr.org/main/python/all-about-python-and-unicode. This page contains lots of info, but it doesn’t contain the simple solution I need: replace illegal unicode characters with another character (say, “?”). Here’s some sample code that does it. Enjoy in good health.

import xml.dom.minidom
import re

# from http://boodebr.org/main/python/all-about-python-and-unicode#UNI_XML
RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \
                 u'|' + \
                 u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
                  (unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                   unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                   unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff))
x = u"<foo>text\u001a</foo>"
x = re.sub(RE_XML_ILLEGAL, "?", x)
dom = xml.dom.minidom.parseString(x.encode("utf-8"))

The power is all in the regex, which is thanks to boodebr.org. Change the regex matching as you will. Comments informing me of my code’s crappiness are welcome.

Max

About these ads

11 thoughts on “Python’s minidom, xml, and illegal unicode characters

  1. I often skip section 2.2 while reading the XML spec, too, so don’t feel bad.

    Seriously though, that’s pretty weird. I wonder what, if any, xml libraries have built-in options for omitting / ?-replacing the invalid characters?

    Last time I was pythoning, lxml and etree seemed like the new hotness WRT xml processing, so it would be interesting to peek at them. Not interesting enough for me to actually do it, mind you.

  2. A friend told me that the presence of the actual offending character was breaking Opera’s feed reader (probably, it also broke other readers). I removed it. Perhaps it will un-break the feed now.

  3. Thanks for posting this, it was very helpful to me.

    BTW, I would suggest using x = re.sub(regex, ‘?’, x) instead of the for loop.

  4. I tried this piece of code as I was having problems with apostrophe in the xml. It did not work for me. Not sure what Im doing wrong !

  5. stripping illegal characters out of xml in python « LSD::RELOAD

Comments are closed.