O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


Epilogue - Algorithms in a Nutshell

by Gary Pollice, George T. Heineman, Stanley Selkow

While we have reached the end of this book, there is almost no limit to how much information you can find on algorithms in which you are interested. Indeed, there is no end to the kind of problems to which you can apply the techniques presented in this book.

Algorithms in a Nutshell  book cover

This excerpt is from Algorithms in a Nutshell . Creating robust software requires the use of efficient algorithms. Algorithms in a Nutshell describes a large number of existing algorithms for solving a variety of problems, and helps you select and implement the right algorithm for your needs. With its focus on application, rather than theory, this book provides efficient code solutions in several programming languages that you can easily adapt to a specific project.

buy button

Overview

We finally have the opportunity to step back and review the nearly three dozen algorithms that we described in detail and by example. We hope you are satisfied that we have accomplished what we set out to do. To show the breadth of material that we’ve covered, we’ll now summarize the principles behind the algorithms presented in this book. In doing so, we can demonstrate the similarities of different algorithms that were designed to solve different problems. Instead of simply summarizing each of the previous chapters, we’ll end this book by focusing on key principles that were instrumental in designing these algorithms in the first place. We also take this opportunity to summarize the concepts used by each algorithm; recall that these were listed in the algorithm fact sheets in the upper-right corner of those figures. In doing so, we provide a quick summary and make it possible to cross-index this book in terms of shared concepts across different algorithms.

Principle: Know Your Data

We discussed a variety of common actions you might need to perform on some data. You might need to sort data to produce a specific ordering. You might need to search through data to locate a specific piece of information. Your data may be accessible in random access (where you can fetch any piece of information at any time) or sequentially using an Iterator (where each element is generated one at a time). Without specific knowledge about your data, it is only possible to recommend algorithms in the most general way.

If you are sorting data, there is no “one size fits all” approach that consistently delivers the best performance. Table 11-1 summarizes the results of the sorting algorithms presented in Chapter 4. Do you have a set of integers from a limited range to be sorted? No sorting algorithm will be faster than COUNTING SORT, although it requires more storage than other sorting algorithms. Do you have a set of complex data that is already mostly sorted? INSERTION SORT will typically outperform any other approach. Does relative ordering of equal elements matter to you? If so, then a stable sorting algorithm is needed. Are you sure that your input data is drawn from a uniform distribution? You must investigate using BUCKET SORT because of its ability to take advantage of this property to provide exceptional sorting performance. You will be able to select the most appropriate algorithm based upon your data as you become more familiar with the available options.

Table 11-1. Chapter 4: Sorting algorithms

Algorithm Best Average Worst Concepts Page
INSERTION SORT n n2 n2 Array 64
MEDIAN SORT n log n n log n n2 Array, Recursion, Divide and 68
Conquer
SELECT KTH n n n2 Divide and Conquer
BLUM-FLOYD-PRATT-RIVESTn n n Recursion, Divide and
TARJAN (BFPRT) Select Kth Conquer
QUICKSORT n log n n log n n2 Array, Recursion, Divide and 79
Conquer
SELECTION SORT n2 n2 n2 Array, Greedy
HEAP SORT n log n n log n n log n Array, Recursion, Binary Heap 87
COUNTING SORT n n n Array 92
BUCKET SORT n n n Array, Hash 94

Principle: Decompose the Problem into Smaller Problems

When designing an efficient algorithm to solve a problem, it is helpful if the problem can be decomposed into two (or more) smaller subproblems. It is no mistake that QUICKSORT remains one of the most popular sorting algorithms. Even with the well-documented special cases that cause problems, QUICKSORT offers the best average-case for sorting large collections of information. Indeed, the very concept of an O(n log n) algorithm is based on the ability to (a) decompose a problem of size n into two subproblems of about n/2 in size, and (b) recombine the solution of the two subproblems into a solution for the original problem. To properly produce an O(n log n) algorithm, it must be possible for both of these steps to execute in O(n) time.

QUICKSORT was the first in-place sorting algorithm to demonstrate O(n log n) performance. It succeeds by the novel (almost counterintuitive) approach for dividing the problem into two halves, each of which can be solved recursively by applying QUICKSORT to the smaller subproblems.

Principle: Decompose the Problem into Smaller Problems | 315

Problems often can be simply cut in half, leading to impressive performance savings. Consider how BINARY SEARCH converts a problem of size n into a problem of size n/2. BINARY SEARCH takes advantage of the repetitive nature of the search task to develop a recursive solution to the problem.

Sometimes a problem can be solved by dividing it into two subproblems without resorting to recursion. CONVEX HULL SCAN produces the final convex hull by constructing and merging together two partial hulls (the upper and lower).

Sometimes a problem can be decomposed into the repeated iteration of a different (seemingly unconnected) smaller problem over the same input data. FORD-FULKERSON computes the maximum flow in a flow network by repeatedly locating an augmenting path to which flow can be added. Eventually, no augmenting paths are possible and the original solution is solved. SELECTION SORT repeatedly locates the maximum value in an array and swaps it with the rightmost element in the array; upon completing n iterations, the array is sorted. Similarly, HEAP SORT repeatedly swaps the largest element in the heap with its proper location in the array.

Table 11-2 contains a comparison of the searching algorithms discussed in Chapter 5.

Table 11-2. Chapter 5: Searching algorithms

Algorithm Best Average Worst Concepts Page
SEQUENTIAL SEARCH 1 n n Array, Brute Force 107
BINARY SEARCH 1 log n log n Array, Divide and Conquer 112
HASH-BASED SEARCH 1 1 n Array, Hash 117
BINARY TREE SEARCH 1 log n n Binary Tree

Principle: Choose the Right Data Structure

The famed algorithm designer Robert Tarjan was once quoted as saying that any problem can be solved in O(n log n) time with the right data structure. Many algorithms need to use a priority queue to store partial progress and direct future computations. One of the most common means of implementing a priority queue is through a binary heap, which allows for O(log n) behavior for removing the element with lowest priority from the priority queue. However, a binary heap offers no ability to determine whether it contains a specific element. We expanded on this very point in the discussion of LINE SWEEP (Chapter 9), since this algorithm can only provide O(n log n) performance because it uses an augmented binary tree to implement the priority queue and still provides O(log n) performance for removing the minimum element. Another way of stating this principle is to beware of selecting an inappropriate data structure that will prevent an algorithm from achieving its best performance.

Table 11-3 shows the graph algorithms discussed in Chapter 6.

Table 11-3. Chapter 6: Graph algorithms
Algorithm Best Average Worst Concepts Page
DEPTH-FIRST SEARCH V+E V+E V+E Graph, Array, Recursion, 144
Backtracking
BREADTH-FIRST SEARCH V+E V+E V+E Graph, Array, Queue 150
DIJKSTRAS ALGORITHM PQ (V+E) log V (V+E) log V (V+E) log V Weighted Directed Graph, Array, Priority Queue, Overflow 154
DIJKSTRAS ALGORITHM DG V2+E V2+E V2+E Weighted Directed Graph, Array, Overflow 158
BELLMAN-FORD ALGORITHM V*E V*E V*E Weighted Directed Graph, Array, Overflow 162
FLOYD-WARSHALL ALGORITHM V3 V3 V3 Dynamic Programming, 2D Array, Weighted Directed Graph, Overflow 166
PRIMS ALGORITHM (V+E) log V (V+E) log V (V+E) log V Weighed Graph, Binary Heap, Priority Queue, Greedy, Array 171

Principle: Add Storage to Increase Performance

Many of the computations carried out by the algorithms are optimized by storing information that reflects the results of past computations. PRIMS ALGORITHM for computing the minimum spanning tree for a graph uses a priority queue to store the unvisited vertices in order of their shortest distance to an initial vertex s. During a key step in the algorithm, one must determine whether a given vertex has already been visited. Because the binary heap implementation of the priority queue fails to provide this operation, a separate Boolean array inQueue is maintained to record the status of each vertex. In the same algorithm, a duplicate key array stores the computed distances to avoid having to search again through the priority queue. This extra storage on the order of O(n) is required to ensure the efficient implementation of the algorithm. In most situations, as long as the overhead is O(n), you are going to be safe.

Sometimes an entire computation can be cached to ensure that it never needs to be recomputed. In Chapter 6, we discussed how the hash function for the java.lang.String class stores the computed hash value to speed up its performance.

Sometimes the nature of the input set demands a large amount of storage, such as the dense graphs described in Chapter 6. By using a two-dimensional matrix to store the edge information—rather than using simple adjacency lists—certain algorithms exhibit reasonable performance. Also, you may note that for undirected graphs, the algorithms are made simpler if we assume that we use twice as much storage as necessary and store in the two-dimensional matrix information for edgeInfo[i][j] as well as edgeInfo[j][i]. Now it would be possible to eliminate this extra information if one always queried for edgeInfo[i][j] using ij, but this would further complicate each and every algorithm that simply desired to know whether edge (i,j) exists.

Principle: Add Storage to Increase Performance | 317

Sometimes an algorithm is unable to operate without some higher-than-expected storage. BUCKET SORT shows its ability to sort in linear time simply by storing up to O(n) extra storage if the input set is uniformly distributed. Given that today’s modern computers often have very large random access memory present, you should consider BUCKET SORT even though its memory requirements are so high.

Principle: If No Solution Is Evident, Construct a Search

Early pioneers in the field of artificial intelligence (AI) were often characterized as trying to solve problems for which no known solution existed. One of the most common approaches to solve problems was to convert the problem into a search over a (very large) graph. We dedicate an entire chapter to this approach because it is so important, and it is such a general technique for solving numerous problems. Be careful to apply it when no other computational alternative is available, however! You could use the path-finding approach to discover a sequence of element transpositions that starts from an unsorted array (the initial node) and produces a sorted array (the goal node), but you shouldn’t use an algorithm with exponential behavior because numerous O(n log n) algorithms exist to sort the data. Table 11-4 shows the path finding algorithms discussed in Chapter 7.

Table 11-4. Chapter 7: Path finding in AI

Algorithm Best Average Worst Concepts Page
DEPTH-FIRST SEARCH b*d bd bd Stack, Set, Backtracking 182
BREADTH-FIRST SEARCH bd bd bd Queue, Set 190
A*SEARCH b*d bd bd Priority Queue, Set, Heuristics 195
MINIMAX bply bply bply Recursion, Backtracking, Brute Force 208
NEGMAX bply bply bply Recursion, Backtracking, Brute Force 214
ALPHABETA bply/2 bply/2 bply Recursion, Backtracking, Heuristics 218

Principle: If No Solution Is Evident, Reduce Your Problem to Another Problem That Has a Solution

Problem reduction is one of the fundamental approaches used by computer scientists and mathematicians in solving problems. As a simple example, suppose you wanted an algorithm to locate the fourth largest element in a list. Instead of writing this special-purpose code, you could use any sorting algorithm to sort the list and then return the fourth element in the sorted list. Using this approach, you have defined an algorithm whose performance time is O(n log n); although this is not the most efficient way to solve the problem—see the selectKth method described in Chapter 4 instead—it is correct.

Chapter 8 presented a set of problems that all seemed related, but there didn’t seem to be any easy way to tie them all together. It is possible to reduce all of these problems into linear programming (LP) and use commercially available software packages, such as Maple, to compute solutions, but the reductions are complicated; in addition, the general-purpose algorithms used to solve LP problems can be outperformed, often significantly, by the FORD-FULKERSON family of algorithms.

We show in Chapter 8 how to solve a single problem type, namely computing the minimum-cost maximum flow in a flow network. With this algorithm in hand, the five other problems are immediately solved.

Table 11-5 shows the network flow algorithms described in Chapter 8.

Table 11-5. Chapter 8: Network flow algorithms

Algorithm Best Average Worst Concepts Page
FORD-FULKERSON E*mf E*mf E*mf Weighted Directed Graph, Array, Greedy 230
EDMONDS-KARP V*E2 V*E2 V*E2 Weighted Directed Graph, Array, Greedy

Principle: Writing Algorithms Is Hard—Testing Algorithms Is Harder

Because the algorithms we describe are predominantly deterministic (except for those from Chapter 11), it was rather straightforward to develop test cases to ensure that they behaved properly. In Chapter 7, we began to encounter difficulties because we were using path-finding algorithms to locate potential solutions that we did not know in advance. For example, although it was straightforward to write test cases to determine whether the GoodEvaluator heuristic was working properly for the 8-puzzle, the only way to test an A*SEARCH using that heuristic is to invoke the search and manually inspect the explored tree to validate that the proper move was selected. Thus, testing A*SEARCH is complicated by having to test the algorithm in the context of a specific problem and heuristic. We have extensive test cases for the path-finding algorithms, but in many cases they exist only to ensure that a “reasonable” move was selected (for either game or search trees), rather than to ensure that a specific move was selected.

Testing the algorithms in Chapter 9 was further complicated because of floating-point computations. Consider our approach to test CONVEX HULL SCAN. The original idea was to execute a BRUTE FORCE CONVEX HULL algorithm—whose performance was O(n4)—and compare its output with the output from Andrew’s CONVEX HULL SCAN. During our extensive testing, we randomly generated two-dimensional data sets uniformly drawn from the [0,1] unit square. However, when the data sets grew sufficiently large, we invariably encountered situations where the results of the two algorithms were different. Was there a subtle defect exposed by the data, or was something else at work? We eventually discovered that the floating-point arithmetic used by the BRUTE FORCE algorithm produced slightly (ever so slightly) different results when compared with CONVEX HULL SCAN. Was this just a fluke? Unfortunately, no. We also noticed that the LINE SWEEP algorithm produced slightly different results when compared against the BRUTE FORCE INTERSECTION algorithm. Which algorithm produced the “right” result? It’s not that simple, because using floating-point values led us to develop a consistent notion of comparing floating-point values. Specifically, we (somewhat) arbitrarily defined FloatingPoint.epsilon to be the threshold value below which it becomes impossible to discern differences between two numbers. When the resulting computations lead to values near this threshold (which we set to 10–9), unexpected behavior would often occur. Eliminating the threshold entirely won’t solve the

Principle: Writing Algorithms Is Hard—Testing Algorithms Is Harder | 319

problem, either. We ultimately resorted to statistically checking the results of these algorithms, rather than seeking absolute and definitive answers for all cases.

Table 11-6 summarizes computational geometry, covered in Chapter 9.

Table 11-6. Chapter 9: Computational geometry

Algorithm Best Average Worst Concepts Page

CONVEX HULL SCAN nn log nn log n Array, Greedy 261 LINE SWEEP (n+k) log n (n+k) log n n2 Priority Queue, Binary Tree 270, 271 NEAREST NEIGHBOR QUERY log n log nn kd-tree, Recursion 283

n1-1/d+rn1-1/d+r

RANGE QUERIES n kd-tree, Recursion 292

If you enjoyed this excerpt, buy a copy of Algorithms in a Nutshell .

Copyright © 2009 O'Reilly Media, Inc.