Women in Technology

Hear us Roar



Article:
  Cooking with Python, Part 1
Subject:   Source file encoding
Date:   2005-06-18 03:01:49
From:   CraigRinger
When working with unicode, a few other things are worth noting:


If you're writing unicode literal strings in your source code, consider using the magic source encoding somment:


# -*- coding: utf-8 -*-


to tell Python that your source is in utf-8 not the default 7-bit ASCII. This lets you avoid doing ugly things with the unicode(...) constructor when you just want to use a non-ASCII literal. See PEP 263 (http://www.python.org/peps/pep-0263.html) for the gory details.


There is a shorthand for a unicode string: u'' . Again, this can be useful with unicode literals, etc. For example, u'fred' .


Also, DO NOT change sysdefaultencoding. You should never need to, and it can potentially cause some interesting effects in other modules. Access to it is deleted from the sys module for a reason. In most cases people change sysdefaultencoding because they haven't put an encoding specifier comment in their source.


Another thing to be aware of is that some code, mostly Python/C extension modules, may not handle unicode gracefully. Python/C code must handle Unicode and byte strings very distinctly, unlike pure Python (eg it must specify which it wants in argument processing). This means such code won't work without explicitly handling unicode. Problems aren't common, but they're incredibly annoying when they do turn up.


Overall, I find unicode in Python to be somewhat troublesome, but nowhere near as bad as in most other languages. C++ with Qt probably does it as well, but Python is more explicit and strict (you'll get an error instead of garbage text when you screw up). The main thing is to *understand* text encodings, and to make sure to be explicit about text encodings at all your I/O points in the program.