Python’s minidom, xml, and illegal unicode characters
Python is good at unicode. Python has a convenient API for parsing XML. Python is even good at parsing unicode XML. Great!
However, it turns out that there are a bunch of unicode characters that are actually illegal in XML. Wha? Oh, I forgot to read section 2.2 of the XML 1.0 standard.
Check this out. If we run the following code:
import xml.dom.minidom
x = u"<foo>text\u001a</foo>"
dom = xml.dom.minidom.parseString(x.encode("utf-8"))
We get a nice exception:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
Our friend, the \u001a character, has caused a little problem. While this character is legal utf-8, it is not legal in xml, as I figured out from reading this page: http://boodebr.org/main/python/all-about-python-and-unicode. This page contains lots of info, but it doesn’t contain the simple solution I need: replace illegal unicode characters with another character (say, “?”). Here’s some sample code that does it. Enjoy in good health.
import xml.dom.minidom
import re
# from http://boodebr.org/main/python/all-about-python-and-unicode#UNI_XML
RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \
u'|' + \
u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
(unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff))
regex = re.compile(RE_XML_ILLEGAL)
x = u"<foo>text\u001a</foo>"
for match in regex.finditer(x):
x = x[:match.start()] + "?" + x[match.end():]
dom = xml.dom.minidom.parseString(x.encode("utf-8"))
The power is all in the regex, which is thanks to boodebr.org. Change the regex matching as you will. Comments informing me of my code’s crappiness are welcome.
Max
5 comments so far
Leave a reply
I often skip section 2.2 while reading the XML spec, too, so don’t feel bad.
Seriously though, that’s pretty weird. I wonder what, if any, xml libraries have built-in options for omitting / ?-replacing the invalid characters?
Last time I was pythoning, lxml and etree seemed like the new hotness WRT xml processing, so it would be interesting to peek at them. Not interesting enough for me to actually do it, mind you.
I want really thanks to you. It saved my life.
A friend told me that the presence of the actual offending character was breaking Opera’s feed reader (probably, it also broke other readers). I removed it. Perhaps it will un-break the feed now.
Thanks for posting this, it was very helpful to me.
BTW, I would suggest using x = re.sub(regex, ‘?’, x) instead of the for loop.
Thank you for this. Really helped me.