Related link: http://www.oreillynet.com/pub/wlg/8097
Over the last
two
weeks,
I’ve been talking about optimizing code, starting with the information
gathering phase (choosing an optimization target, profiling, benchmarking,
and so on), and continuing with theoretical reasons code is slow (such as
trying to solve a bigger problem than the one at hand, or choosing a poor
algorithm). This week we’ve finally gotten down to the wires and chips:
why real hardware makes seemingly good code run dog slow.
As I mentioned last week, big-O notation ignores the constants that
determine how long each primitive operation takes, and even how many
primitive operations go into each “high level” operation. Even worse,
these constants aren’t really constant. (Osborne’s law: “Variables won’t,
constants aren’t.”) Why? To answer that, let’s take a little tour
of a modern computer.
The heart of each computer is of course its processors. I use the
plural there because the average desktop or video game console contains
at least two powerful processors — the CPU doing most of the system
management and application logic, and the GPU slinging pixels with
abandon. Workstations (and even high end desktops) often contain two
of each, and servers may have dozens of CPUs. There are also various
smaller processors scattered through the system, including DSPs, I/O
processors, network processors, and so on.
All of these processors need to be able to talk to each other, so
the computer is also filled with communication links, either in the
form of shared channels such as PCI, or point to point links such as
HyperTransport. The processors also need external I/O, so another
set of communication channels exist for that; once again, some are
shared channels (SCSI), and some are point to point (SATA).
These communication channels bring us to the first big bottleneck:
communication is slow. Every channel has a maximum bandwidth capacity
as well as a latency involved in transferring data from one end of the
channel to the other. There is also a certain amount of overhead to
initiate a transfer, and often additional overhead to keep a large
transfer going. These three limitations add up to big problems:
- Transferring a large chunk of data will be slowed significantly
by the bandwidth limits of the channel. Transferring 2 GB over
a 1 GB/s link will take (at least) 2 seconds.
- On shared channels, when more than one device wants to transfer
data, they not only have to share bandwidth, but often pay additional
overhead to switch channel users, so each only gets a fraction
of the bandwidth of the channel, adding up to well under 100%
of the total available.
- Transferring small chunks of data will be limited by both the
latency of the channel and the overhead to set up each transfer,
especially if that overhead requires a handshake (a protocol in
which multiple endpoints have to agree to set up the transfer).
In the handshake case, the latency cost will be paid several
times over while the various devices reach agreement.
The data flowing across all these channels needs to be stored somewhere,
so each processor has a chunk of memory to call its own. Slow devices
produce (and consume) little data, so this may only need to be a few KB
or even just a few bytes, and it can be stored right on the processor chip.
More powerful processors, including the CPUs and GPUs, can produce and
consume truly massive piles of data. These processors have sizeable
local memories, from a few dozen MB to a few GB.
That’s too much memory
to put on the processor itself, so it gets separated onto other
devices which — you guessed it — have to communicate with the processor
over a slow channel. Worse yet, big memory is slow to access anyway;
channel latency adds to memory chip latency and memory transaction overhead
to make off-chip memory quite slow.
Because big off-chip memory is slow, fast processors have on-chip cache
memory as well, which tries to keep copies of data that is likely to be
used again soon, so that the processor will not have to go across the
channel to get it. Of course, cache is not a panacea. It must be
smaller than the main memory, of course, and the heuristics used to pick
which pieces of main memory to keep copies of, and which data can be tossed
to make room because it won’t be used again soon, are often wrong.
Processor designers can compensate for this weakness by making the
cache bigger, but that makes it slower (big memory is slow, remember).
Pretty soon the silicon spent on cache takes as much room as the
processor itself, so the processor and cache are once again forced to
communicate over a channel that will slow things down. (Sure, the
channel happens to sit right on the processor chip or perhaps right
next to it in the same plastic package, but even this is a long
enough channel to notice the slowdown.) On very fast processors,
the big caches themselves have smaller caches to try to hide the
performance problems inherent in a large cache. Lather, rinse, repeat.
Even if the heuristics were perfect, cache can’t help at all in some
situations, such as when streaming large chunks of data. In this case,
data is only used once, and then either sent to main memory or to another
processor; keeping a copy of the data won’t speed that up. It’s worse
than that, of course; the cache itself adds latency to every memory
access, so keeping a copy of streamed data actually slows everything
down. Also, in order to simplify and speed up the cache design, caches
typically only copy data in chunks of a certain minimum size. Algorithms
that jump around memory grabbing small bits of data tend to be slowed
significantly by caches, as the cache unnecessarily copies more data
from main memory than it needs to (sometimes a lot more), which chews
up much of the bandwidth of the main memory channel.
Still, there are enough cases in which cache is a win (and sometimes a
significant win) that virtually every processor has one or more
of them. These strata of caches of different sizes and speeds create
what is known as the memory hierarchy. At one end are the
ultra-fast registers, tiny chunks of memory that add up to perhaps 1KB
or less. A little slower and an order of magnitude or two larger is the
L1 (level one) cache; slower and larger still is the L2 cache, and some
processors even have an L3. Beyond that comes the processor’s own main
memory, and then memory belonging to other processors, which must be
reached over yet more channels inside the computer.
Relatively often, all the memory belonging to all the processors in the
computer still isn’t enough to handle all the work the user demands at
once, so every modern operating system supports virtual memory,
in which some of the data is stored on a local disk or even across a
network on another computer. Unfortunately, disks and networks are
even slower than the slowest communications channel inside the computer.
Both have moderate to severe bandwidth limitations (especially networks),
and both have severe to insane latency issues. Processors count time
in fractions of a billionth of a second; disks and cross-country networks
take several milliseconds to set up a data transfer. This is a difference
of 7 or 8 orders of magnitude, and amounts to an eternity from the point
of view of the processor. Sadly, these limitations come partly from
physics and mechanical engineering issues, such as the maximum spin speed
of a disk drive platter using current materials and fabrication techniques,
or the speed of light in a long-haul fiber optic cable. There’s room
for improvement, but not orders of magnitude.
Of course, disks and network devices have caches of their own to hide
some of these issues. In reality, these are mostly just buffers; in
other words, their primary duty is to impedance match a fast, high
bandwidth channel to the processor with a slow channel to the disk or
network, so that the processor can spend as little time as possible
waiting for data to trickle in or out. Disk buffers are actually fairly
sophisticated, reordering incoming and outgoing data requests to
minimize the time spent physically moving disk heads back and forth.
Caching and buffering are great things, but essentially they
are kludges, attempts to hide some of the problems some of the time.
As I mentioned before, there are data use patterns that cannot
possibly be sped up by caching, no matter how sophisticated. And there
will always be data use patterns that will be slow in reality, even
if they could be cached in theory. These tend to trip over tradeoffs
in cache design in which the chip designer chose a different tradeoff
than the use pattern would prefer.
Knowing that processors will be forced to wait, the next
technique for keeping them busy is to do more than one thing at once.
That way, anytime the processor has to wait on a data channel (either
sending or receiving), it can switch to another task and try to get
work done on it. This of course only works until that task
has to wait, at which time the processor tries to go to a third task,
or perhaps just back to the first task to check if its data is ready.
As you might imagine, this works much better if there are lots of
tasks, so that there is likely to always be at least one ready to go.
The downside, is that all of those tasks have to share the
fixed resources of the system. Each will want to use some of the
bandwidth of the various communications channels, each will take some
room in various caches, and so on. Eventually there will be enough
tasks that the system slows down significantly because of contention
on these shared resources.
When the contention is over space in some
layer of the memory hierarchy, it’s known as thrashing, and
it is one of the biggest performance walls out there. There’s a
noticeable drop in performance as each layer of the memory hierarchy
is overfilled; the system essentially slows to the speed of the
next layer in the hierarchy. When this gets all the way to thrashing
virtual memory, it not only slows the machine to a crawl, but is
actually audible — even the quietest disk begins to make a good
deal of noise when it is forced to read and write data in many
different places on the disk as fast as possible.
Of course, thrashing can be caused by a single application, too; as
its data set grows, the cacheable portion will often grow too large
and overwhelm the smaller caches all by itself. Still, a programmer
may be able to control the cache friendliness of his own application;
trying to make many applications play nicely together is a much bigger
problem.
Processor designers have more tricks up their sleeve, however. In
particular, they take advantage of more kinds of parallelism than just
the multitasking described above. Current processors have
at least three more ways to get more work done. Simplest (and most recently),
they can simply add additional processor cores to each processor package.
Each core is essentially a complete processor of its own, able to handle
its own set of tasks. Adding cores has an advantage over adding more
processor packages, because the communications channel between the cores
can be very short and wide, and hence very fast. Multiple cores also
usually have their own copies of the fastest caches, but share the largest
caches. This sharing may raise contention issues, but often is a win
because some tasks will tend to be more cache-intensive than others, and
the same total amount of cache can be apportioned in a better way.
Before there was room for multiple cores, processor designers added
more functional units to each processor core. This allowed the
processor to be simultaneously performing several different calculations
within the same task at the same time. Unfortunately, it’s usually
not helpful to simply add dozens of additional functional units. In
most applications, later calculations depend on the results of earlier
ones, and pretty soon most of the extra functional units are idle
waiting for each other’s results. (It happens that computer graphics
is one very large exception to this, so GPUs have literally hundreds
of functional units in them.)
Before processors were made “wider” by adding more functional units,
they were made “longer”. By splitting every instruction handled by the
processor into a sequence of suboperations, and using dedicated (and
highly optimized) hardware for each suboperation, it is possible to form a
pipeline within the processor through which instructions flow.
At any given time, each piece of dedicated hardware, or pipeline
stage, has a different instruction in it. The processor may be
decoding one instruction, fetching input data for another, performing
various stages of calculation for others, and finally writing the
output data for yet more.
Pipelining has two huge benefits — first, older processors still had
to perform these suboperations, but they could only do one at a time,
so every instruction took several clock cycles. A full pipeline can
overlap these suboperations, and so can complete a new instruction every
clock cycle. Second, breaking instructions down even more, into
smaller and smaller suboperations, allows the processor’s clock speed
to be increased significantly. Of course, this only works up to a
certain point; as Intel discovered, pretty soon physics and various
types of overhead get in the way of achieving faster processors this way.
Pipelines also have a major weakness, in the form of stalls,
also known as bubbles. These occur when one stage of the
pipeline has to wait on another stage, and so must sit idle for one or
more clock cycles. The problem is that since a pipeline is always
flowing, idleness moves downstream too; each stage will be idle at
least as long as its predecessor. If the first stage of the pipeline
stalls at cycle number 1, for example (waiting on data perhaps), then
every stage of the pipeline will waste a cycle later on.
The second stage will be idle at cycle 2, the third stage at cycle 3,
and so on.
There are a great many ways that pipelines can stall. Waiting for
data is an obvious one, but two more are extremely common as well. The
first is when a young instruction in an early stage of the pipeline has
to wait for a calculation being performed by a later stage for an older
instruction. The stages in between will idle out until the calculation
completed and the earlier stage can continue performing useful work.
The second common stall is when a branch occurs; by the time the
processor realizes that it must take a branch, the branch instruction
will be near the end of the pipeline. The instructions that were in
the pipeline behind the branch were from the wrong code path, so they
all have to be thrown away. Then there’s usually some extra time spent
figuring out where the branch destination is in virtual memory, trying
to find the proper next instruction in one or more caches, and so on.
During all of this, the whole pipeline sits idle. Ouch.
It should come as no surprise that processor designers try to hide these
problems as well. The first case is partially hidden by reordering
program instructions on the fly in a way that will calculate the same
results, but prevent as many of these cross-stage waits as possible.
The branching problem is hidden with a host of tricks, including trying
to guess the correct destination from previous history and executing both
paths after a binary branch and then keeping the results from the right
path once the final branch decision is made (this at least keeps the
pipeline half busy, instead of completely idle). Once again,
these tricks make the average case somewhat better, but there are many
reasons they can fail, and even make things worse rather than better.
I’ve really only scratched the surface of all the ways that real computers
can run much slower than algorithmic analysis might indicate. For
example, I’ve almost completely ignored the huge class of performance
gotchas that arise from processor designers trying to save a little space
here and there; these often appear as special cases that can drive you mad.
(”You can do two multiplies and an add, or two adds and a multiply, at
full speed; but it takes an extra cycle to do three in a row of either one.”)
So now what? If there are so many ways that code can run slowly on real
hardware, what can you do to minimize these problems? It turns out that
there are some broad techniques that should help you reduce the affects
of most of what I’ve written; those techniques will be the subject
of my next post.
What’s the most infuriating performance gotcha you’ve come across?