O'Reilly Book Excerpts: Python Cookbook, Second Edition
Cooking with Python, Part 1
Editor's note: The second edition to Python Cookbook has been updated for Python 2.4 to include more than 200 recipes with solutions to problems that Python programmers face every day. We've selected two new recipes from the book to showcase here; check back next week for two additional recipes on implementing a ring buffer and computing prime numbers.
Recipe 1.20: Handling International Text with Unicode
Credit: Holger Krekel
You need to deal with text strings that include non-ASCII characters.
Python has a first class
unicode type that you can
use in place of the plain bytestring
It's easy, once you accept the need to explicitly
convert between a bytestring and a Unicode string:
>>> german_ae = unicode('\xc3\xa4', 'utf8')
Here german_ae is a
unicode string representing the German lowercase a
with umlaut (i.e., diaeresis) character
"ae". It has been constructed from
interpreting the bytestring '
to the specified UTF-8 encoding. There are many encodings, but UTF-8
is often used because it is universal (UTF-8 can encode any Unicode
string) and yet fully compatible with the 7-bit ASCII set (any ASCII
bytestring is a correct UTF-8-encoded string).
Once you cross this barrier, life is easy! You can manipulate this
Unicode string in practically the same way as a plain
>>> sentence = "This is a " + german_ae >>> sentence2 = "Easy!" >>> para = ". ".join([sentence, sentence2])
Note that para is a
Unicode string, because operations between a
unicode string and a bytestring always result in a
unicode string—unless they fail and raise an
>>> bytestring = '\xc3\xa4' # Uuh, some non-ASCII bytestring! >>> german_ae += bytestring
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 0: ordinal not in range(128)
The byte '
0xc3' is not a valid character in the
7-bit ASCII encoding, and Python refuses to guess an encoding. So,
being explicit about encodings is the crucial point for successfully
using Unicode strings with Python.
Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of Unicode is an easy task. Luckily, as with other hard problems, you don't have to care much: you can just use the efficient implementation of Unicode that Python provides.
The most important issue is to fully accept the distinction between a
bytestring and a
unicode string. As exemplified in
this recipe's solution, you often need to explicitly
unicode string by providing a
bytestring and an encoding. Without an encoding, a bytestring is
basically meaningless, unless you happen to be lucky and can just
assume that the bytestring is text in ASCII.
The most common problem with using Unicode in Python arises when you
are doing some text manipulation where only some of your strings are
unicode objects and others are bytestrings. Python
makes a shallow attempt to implicitly convert your bytestrings to
Unicode. It usually assumes an ASCII encoding, though, which gives
UnicodeDecodeError exceptions if you actually
have non-ASCII bytes somewhere.
tells you that you mixed Unicode and bytestrings in such a way that
Python cannot (doesn't even try to) guess the text
your bytestring might represent.
Developers from many big Python projects have come up with simple
rules of thumb to prevent such runtime
UnicodeDecodeErrors, and the rules may be
summarized into one sentence: always do the conversion at IO
barriers. To express this same concept a bit more extensively:
Whenever your program receives text data "from the outside" (from the network, from a file, from user input, etc.), construct
unicodeobjects immediately. Find out the appropriate encoding, for example, from an HTTP header, or look for an appropriate convention to determine the encoding to use.
Whenever your program sends text data "to the outside" (to the network, to some file, to the user, etc.), determine the correct encoding, and convert your text to a bytestring with that encoding. (Otherwise, Python attempts to convert Unicode to an ASCII bytestring, likely producing
UnicodeEncodeErrors, which are just the converse of the
UnicodeDecodeErrors previously mentioned).
With these two rules, you will solve most Unicode problems. If you
UnicodeErrors of either kind, look for
where you forgot to properly construct a
object, forgot to properly convert back to an encoded bytestring, or
ended up using an inappropriate encoding due to some mistake. (It is
quite possible that such encoding mistakes are due to the user, or
some other program that is interacting with yours, not following the
proper encoding rules or conventions.)
In order to convert a Unicode string back to an encoded bytestring, you usually do something like:
>>> bytestring = german_ae.decode('latin1') >>> bytestring
bytestring is a German ae character in the
latin1' encoding. Note how
\xe4' (in Latin1) and the previously shown
\xc3\xa4' (in UTF-8) represent the same German
character, but in different encodings.
By now, you can probably imagine why Python refuses to guess among
the hundreds of possible encodings. It's a crucial
design choice, based on one of the Zen of Python
principles: "In the face of ambiguity, resist the
temptation to guess." At any interactive Python
shell prompt, enter the statement
import this to
read all of the important principles that make up the Zen
Unicode is a huge topic, but a recommended book is
Unicode: A Primer, by Tony Graham (Hungry
Minds, Inc.)--details are available at http://www.menteith.com/unicode/primer/;
and a short but complete article from Joel Spolsky, "The
Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No
Excuses)!," located at http://www.joelonsoftware.com/articles/Unicode.html.
See also the Library Reference and
Python in a Nutshell documentation about the
also, Recipe 1.21 and
Pages: 1, 2