In one of my OSCON blogs,
I objected to Damian Conway’s suggestion to convert
unless is_interesting to if not_interesting,
partially because making a not_interesting wrapper that does nothing
but negate the result of is_interesting is a performance drain
(an extra function call) that doesn’t produce any real benefit.
A couple weeks later, Josef Fortier called me on this, asking “Is perl’s
compiler pass not smart enough to optimize this away?” I began to answer
his question directly — “No, it’s not, and the performance difference can be
significant” — when it occurred to me that I should really be teaching
Josef to fish rather than making him dinner, so to speak.
An hour later I realized that teaching someone to fish in these particular
waters is not nearly as trivial as it might at first appear. Knowing when
and how to optimize code is actually something one has to build up with
time and experience. It helps, however, to have a few guidelines to save
you from the sinking the first few hooks into nearby brush, underwater logs,
and the back of your neck. (Okay, okay, I’ll drop the fishing metaphor
. . . .) There’s a lot of ground to cover, so I’ve decided to break this up
into multiple blog entries.
First some overall goals:
- Make decisions based on knowledge, not guesswork.
- Delay optimization and pessimization decisions, and minimize the
(user) code impact of changing your mind later. - Don’t make it hard for others to make different decisions than
you.
Let’s start with that first one. There are two types of knowledge I’m
referring to — knowledge that applies to optimization in general, and
knowledge that applies to a particular problem. I’ll talk about general
knowledge later. For now, think about a particular optimization problem
that you’ve faced, and imagine that we’re going to tackle it again.
The first step is to determine the optimization goal. In other
words, what are you trying to improve, and how much better does it need
to be? “Make it faster” is a common answer, but much too fuzzy — a better
goal might be “Reduce total run time to 5 minutes or less.” The following
are a few common things to optimize for:
- Total run time
- Response time
- Latency
- Concurrency
- Throughput
- CPU utilization
- Memory usage
- Limited passes over data
- Real resource limitation
- Artificial API limitation
And some common goals:
- Keep within a hard limit
- Usually stay within a soft limit
- Make as small / large as possible
- Be scalable (across data sets, across hardware, etc.)
For example, the control software for a
robotic car would
be optimized for response time, staying within hard limits; failure
to do so would result in disaster when the software spent too much time
thinking about how to negotiate a curve and the car ended up in a ditch.
A less obvious but no less crippling hard limit might be the maximum
number of concurrent open file handles your operating environment allows.
Latency jitter in a VoIP application is an example of a soft limit –
people will generally accept occasional breakups in a phone conversation,
but do it too often and your users will seek a different product.
What if the goal isn’t obvious? For a pretty large problem space, a
good starting goal is “Be scalable, but still try not to suck on the
low end.” Of course, you have to determine what scalable means in the
context of your application. If you are writing a network game server,
scalable may mean handling many concurrent users; or it may mean being
able to process very large or complex game worlds. In many cases,
optimizing both of these may be necessary.
That part about not sucking on the low end is actually important. Price
and openness notwithstanding, I doubt MySQL would have been able to
compete in its early days against vastly more scalable database engines
if not for the fact that it was very fast for simple tasks, and
those more scalable engines frankly sucked at doing the simple stuff.
A complement to knowing the optimization goal is knowing when the current
state is good enough. If the goal is to stay within a hard limit, the
answer is clear. For any of the fuzzier goals, you need to know when to
stop optimizing and go back to working on new features, writing more
documentation, or basking in your well-earned riches at a beach or ski
resort.
At this point, you’ve determined what you need to optimize for, what the
related goal is, and what constitutes “good enough”. The next step is
to find out where you currently stand, and that means profiling and
benchmarking. Profiling means instrumenting your application to
determine where and how various resources are spent, the most obvious of
course being time. Benchmarking, on the other hand, refers to
running a series of tests to determine how your application, or a
relevant but simpler piece of code, performs in various situations.
Profiling and benchmarking go hand in hand, and it’s pretty standard to
cycle back and forth several times while optimizing a system. For example,
imagine you’re working on (yet another) log analyzer app. You would start
by benchmarking the application against a number of logs of different
sizes, different content mixes, and so on. You might have discovered one
or more of the following:
- The analyzer takes a long time to start up, but then it processes
the logs pretty fast. This makes it appear much slower for
small logs. - The analyzer is fast for ASCII logs, but is very slow for UTF-8
logs. - The analyzer takes a lot of memory for big logs, and some logs are
so large that the application crashes for lack of memory.
With this benchmark data, you can turn to profiling to determine what parts
of the application are responsible for the bad behavior. This generally
involves running another application called a profiler that
instruments your program with some additional code that reports on its
resource usage, and then runs your application as usual. When the main
application completes, the profiler (or perhaps a separate program
called a profile analyzer) produces a report indicating what
resources were used and by what parts of your code.
In the above example, the profiler may have determined that almost all
of the memory allocation in the program occurs in the read_log_file
routine. Looking at the code, you realize that read_log_file reads
the entire log file into memory all at once, and only after that happens
does any processing occur. What’s the best way to improve that? You
could change the code to read one line at a time, processing each one
individually. You might instead create a large buffer, and read the
log file in chunks sized to fit the buffer, alternating between reading
a few megabytes into the buffer and processing the buffer contents.
You know that both of these techniques will reduce the memory usage for
huge log files, and you may even have a hunch about which method would be
faster. Don’t follow that hunch; go back to benchmarking. Code up both
new techniques and try them. You may find that the performance difference
is insignificant, or that the one you thought would be faster is actually
slower — because your I/O libraries have well-optimized buffering even for
line-at-a-time usage, or because picking a certain buffer size allows you
to use a special very fast memory mapped file primitive. The balance could
even vary wildly depending on which operating system the log analyzer is
tested on. You won’t know until you benchmark.
Along with the advice to profile and benchmark comes a caveat: it is very
easy to get the wrong answers through bad technique. Here are a few
guidelines that may help:
- Use real (or at least statistically reasonable) data. Far too
often benchmarking is done with “sample data” that bears little
resemblance to real-world usage and in fact turns out to have
a completely different performance profile. - Create many different test scenarios, with different data sets,
concurrency levels, task sizes, data compositions, etc. Most
applications perform better in certain situations than others,
and it is all too easy to miss a bad case, such as the UTF-8
decoding issue with our log analyzer. - Test on as many different platforms, and with as many different
configurations, as you can. Every platform has weaknesses and
strengths, based on the priorities of its designers. Some of
these differences are large enough that it may be necessary to
optimize your design differently on each platform; Apache 2
provides knobs to accommodate different process management
and IPC performance in different operating systems, for example. - Design test runs so that the overhead of profiling or benchmarking
does not seriously skew the results. This usually means medium to
large test scenarios for benchmarking, and not turning on all
profiling options at once (which can bring your application to
a crawl and chew up massive amounts of memory to store the
profiling data). It is best to start with a high-level profile
and then get successively deeper on just the areas and statistics
you are interested in. - Perform many runs, and statistically merge the results. In the
early days of single-task computing, it may have been possible to
guarantee that no outside influences altered the results, allowing
perfectly replicated performance results every time. These days,
that’s just not going to happen. Too much is constantly happening
in the background on every modern computer, and the best you can
do is get a relatively good approximation of ideal behavior by
collecting the results of many runs, and computing statistical
measures such as mean and standard deviation of run time. - That said, clear out all of the background noise that you can.
Close browser windows, turn off periodic mail checkers, close
any music players, and so on. You may even want to isolate the
test computers on their own network so that background network
chatter does not skew your results. - Sometimes the only way to get repeatable results is to disable
sources of randomness inside the application, using hardcoded
values instead. For example, you might turn off AI code for a
computer game and use previously recorded choices instead, or
you might replace access to a remote database server with an embedded
database such as SQLite. Do
not be surprised, however, that this will greatly change your
application’s performance profile. That’s acceptable if for
instance you want to concentrate on the performance of your
game’s graphics engine — but it’s a bad choice if you want to
find the overall biggest CPU hog in the app. - It may be worthwhile to determine the minimum possible work
that your application must do to solve a problem, and treat
that as a baseline for further profiling work. For example,
the log analyzer must at least read the entire log file once,
and split it into records for analysis. How fast are those
operations alone? You may find that your application’s
performance is dominated by these minimum operations (because
of slow disk performance, say), or you may find that the
“extras” are turning a fast application into a slow one. - If your problem space has a known process to follow to find
performance bottlenecks, use it. For example, when programming
a GPU, there is a standard sequence of tests to determine whether
the performance bottleneck is frame buffer bandwidth, CPU to GPU
bandwidth, geometry processing, pixel processing, and so on. - If there’s no standard sequence to follow, try to create a set
of test scenarios that will allow you to separate interrelated
performance issues. Often the best way to do this is to vary
one or two performance-affecting knobs at a time, keeping
everything else constant. Once that piece of the performance
profile is well understood, choose another knob to turn, and so
on.
Reams of benchmark and profile data in hand, you now have a good idea
how the application performs in various different scenarios,
and generally where the code has bottlenecks that determine its
overall performance. The next thing you need to determine is why
the code isn’t performing well. And that, my friends, is the subject of
next week’s entry.
What’s your favorite benchmarking/profiling tool?

