ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


O'Reilly Book Excerpts: Python Cookbook, Second Edition

Cooking with Python, Part 1

Editor's note: The second edition to Python Cookbook has been updated for Python 2.4 to include more than 200 recipes with solutions to problems that Python programmers face every day. We've selected two new recipes from the book to showcase here; check back next week for two additional recipes on implementing a ring buffer and computing prime numbers.

Recipe 1.20: Handling International Text with Unicode

Credit: Holger Krekel

Problem

You need to deal with text strings that include non-ASCII characters.

Solution

Python has a first class unicode type that you can use in place of the plain bytestring str type. It's easy, once you accept the need to explicitly convert between a bytestring and a Unicode string:

>>> german_ae = unicode('\xc3\xa4', 'utf8')

Here german_ae is a unicode string representing the German lowercase a with umlaut (i.e., diaeresis) character "ae". It has been constructed from interpreting the bytestring '\xc3\xa4' according to the specified UTF-8 encoding. There are many encodings, but UTF-8 is often used because it is universal (UTF-8 can encode any Unicode string) and yet fully compatible with the 7-bit ASCII set (any ASCII bytestring is a correct UTF-8-encoded string).

Once you cross this barrier, life is easy! You can manipulate this Unicode string in practically the same way as a plain str string:

>>> sentence = "This is a " + german_ae
>>> sentence2 = "Easy!"
>>> para = ". ".join([sentence, sentence2])

Note that para is a Unicode string, because operations between a unicode string and a bytestring always result in a unicode string—unless they fail and raise an exception:

>>> bytestring = '\xc3\xa4'     # Uuh, some non-ASCII bytestring!
>>> german_ae += bytestring
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
  ordinal not in range(128)

The byte '0xc3' is not a valid character in the 7-bit ASCII encoding, and Python refuses to guess an encoding. So, being explicit about encodings is the crucial point for successfully using Unicode strings with Python.

Related Reading

Python Cookbook
By Alex Martelli, Anna Martelli Ravenscroft, David Ascher

Discussion

Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of Unicode is an easy task. Luckily, as with other hard problems, you don't have to care much: you can just use the efficient implementation of Unicode that Python provides.

The most important issue is to fully accept the distinction between a bytestring and a unicode string. As exemplified in this recipe's solution, you often need to explicitly construct a unicode string by providing a bytestring and an encoding. Without an encoding, a bytestring is basically meaningless, unless you happen to be lucky and can just assume that the bytestring is text in ASCII.

The most common problem with using Unicode in Python arises when you are doing some text manipulation where only some of your strings are unicode objects and others are bytestrings. Python makes a shallow attempt to implicitly convert your bytestrings to Unicode. It usually assumes an ASCII encoding, though, which gives you UnicodeDecodeError exceptions if you actually have non-ASCII bytes somewhere. UnicodeDecodeError tells you that you mixed Unicode and bytestrings in such a way that Python cannot (doesn't even try to) guess the text your bytestring might represent.

Developers from many big Python projects have come up with simple rules of thumb to prevent such runtime UnicodeDecodeErrors, and the rules may be summarized into one sentence: always do the conversion at IO barriers. To express this same concept a bit more extensively:

With these two rules, you will solve most Unicode problems. If you still get UnicodeErrors of either kind, look for where you forgot to properly construct a unicode object, forgot to properly convert back to an encoded bytestring, or ended up using an inappropriate encoding due to some mistake. (It is quite possible that such encoding mistakes are due to the user, or some other program that is interacting with yours, not following the proper encoding rules or conventions.)

In order to convert a Unicode string back to an encoded bytestring, you usually do something like:

>>> bytestring = german_ae.decode('latin1')
>>> bytestring
'\xe4'

Now bytestring is a German ae character in the 'latin1' encoding. Note how '\xe4' (in Latin1) and the previously shown '\xc3\xa4' (in UTF-8) represent the same German character, but in different encodings.

By now, you can probably imagine why Python refuses to guess among the hundreds of possible encodings. It's a crucial design choice, based on one of the Zen of Python principles: "In the face of ambiguity, resist the temptation to guess." At any interactive Python shell prompt, enter the statement import this to read all of the important principles that make up the Zen of Python.

See Also

Unicode is a huge topic, but a recommended book is Unicode: A Primer, by Tony Graham (Hungry Minds, Inc.)--details are available at http://www.menteith.com/unicode/primer/; and a short but complete article from Joel Spolsky, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)!," located at http://www.joelonsoftware.com/articles/Unicode.html. See also the Library Reference and Python in a Nutshell documentation about the built-in str and unicode types and modules unidata and codecs; also, Recipe 1.21 and Recipe 1.22.

Recipe 5.10: Selecting the nth Smallest Element of a Sequence

Credit: Raymond Hettinger, David Eppstein, Shane Holloway, Chris Perkins

Problem

You need to get from a sequence the nth item in rank order (e.g., the middle item, known as the median). If the sequence was sorted, you would just use seq[n]. But the sequence isn't sorted, and you wonder if you can do better than just sorting it first.

Solution

Perhaps you can do better, if the sequence is big, has been shuffled enough, and comparisons between its items are costly. Sort is very fast, but in the end (when applied to a thoroughly shuffled sequence of length n) it always takes O(nlogn) time, while there exist algorithms that can be used to get the nth smallest element in time O(n). Here is a function with a solid implementation of such an algorithm:

import random
def select(data, n):
    " Find the nth rank ordered element (the least value has rank 0). "
    # make a new list, deal with <0 indices, check for valid index
    data = list(data)
    if n<0:
        n += len(data)
    if not 0 <= n < len(data):
        raise ValueError, "can't get rank %d out of %d" % (n, len(data))
    # main loop, quicksort-like but with no need for recursion
    while True:
        pivot = random.choice(data)
        pcount = 0
        under, over = [  ], [  ]
        uappend, oappend = under.append, over.append
        for elem in data:
            if elem < pivot:
                uappend(elem)
            elif elem > pivot:
                oappend(elem)
            else:
                pcount += 1
        numunder = len(under)
        if n < numunder:
            data = under
        elif n < numunder + pcount:
            return pivot
        else:
            data = over
            n -= numunder + pcount

Related Reading

Python Cookbook
By Alex Martelli, Anna Martelli Ravenscroft, David Ascher

Discussion

This recipe is meant for cases in which repetitions count. For example, the median of the list [1, 1, 1, 2, 3] is 1 because that is the third one of the five items in rank order. If, for some strange reason, you want to discount duplications, you need to reduce the list to its unique items first (e.g., by applying the Recipe 18.1), after which you may want to come back to this recipe.

Input argument data can be any bounded iterable; the recipe starts by calling list on it to ensure that. The algorithm then loops, implementing at each leg a few key ideas: randomly choosing a pivot element; slicing up the list into two parts, made up of the items that are "under" and "over" the pivot respectively; continuing work for the next leg on just one of the two parts, since we can tell which one of them the nth element will be in, and the other part can safely be ignored. The ideas are very close to that in the classic algorithm known as quicksort (except that quicksort cannot ignore either part, and thus must use recursion, or recursion-removal techniques such as keeping an explicit stack, to make sure it deals with both parts).

The random choice of pivot makes the algorithm robust against unfavorable data orderings (the kind that wreak havoc with naive quicksort); this implementation decision costs about log2N calls to random.choice. Another implementation issue worth pointing out is that the recipe counts the number of occurrences of the pivot: this precaution ensures good performance even in the anomalous case where data contains a high number of repetitions of identical values.

Extracting the bound methods .append of lists under and over as local variables uappend and oappend may look like a pointless, if tiny, complication, but it is, in fact, a very important optimization technique in Python. To keep the compiler simple, straightforward, unsurprising, and robust, Python does not hoist constant computations out of loops, nor does it "cache" the results of method lookup. If you call under.append and over.append in the inner loop, you pay the cost of lookup each and every time. If you want something hoisted, hoist it yourself. When you're considering an optimization, you should always measure the code's performance with and without that optimization, to check that the optimization does indeed make an important difference. According to my measurements, removing this single optimization slows performance down by about 50% for the typical task of picking the 5000th item of range(10000). Considering the tiny amount of complication involved, a difference of 50% is well worth it.

A natural idea for optimization, which just didn't make the grade once carefully measured, is to call cmp(elem, pivot) in the loop body, rather than making separate tests for elem < pivot and elem > pivot. Unfortunately, measurement shows that cmp doesn't speed things up; in fact, it slows them down, at least when the items of data are of elementary types such as numbers and strings.

So, how does select's performance compare with the simpler alternative of:

def selsor(data, n):
    data = list(data)
    data.sort( )
    return data[n]

On thoroughly shuffled lists of 3,001 integers on my machine, this recipe's select takes about 16 milliseconds to find the median, while selsor takes about 13 milliseconds; considering that sort could take advantage of any partial sortedness in the data, for this kind of length, and on elementary data whose comparisons are fast, it's not to your advantage to use select. For a length of 30,001, performance becomes very close between the two approaches—around 170 milliseconds either way. When you push the length all the way to 300,001, select provides an advantage, finding the median in about 2.2 seconds, while selsor takes about 2.5.

The break-even point will be smaller if the items in the sequence have costly comparison methods, since the key difference between the two approaches is in the number of comparisons performed--select takes O(n), selsor takes O(n log n). For example, say we need to compare instances of a class designed for somewhat costly comparisons (simulating four-dimensional points that will often coincide on the first few dimensions):

class X(object):
    def _ _init_ _(self):
        self.a = self.b = self.c = 23.51
        self.d = random.random( )
    def _dats(self):
        return self.a, self.b, self.c, self.d
    def _ _cmp_ _(self, oth):
        return cmp(self._dats, oth._dats)

Here, select already becomes faster than selsor when what we're computing is the median of vectors of 201 such instances.

In other words, although select has more general overhead, when compared to the wondrously efficient coding of lists' sort method, nevertheless, if n is large enough and each comparison is costly enough, select is still well worth considering.

See Also

Library Reference and Python in a Nutshell docs about method sort of type list, and about module random.


View catalog information for Python Cookbook, Second Edition

Return to Python DevCenter.

Copyright © 2009 O'Reilly Media, Inc.