ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Swordfish

Improving Compilation Times: A Case Study in Performance Analysis

by Gian-Paolo D. Musumeci, coauthor of System Performance Tuning, 2nd Edition
02/21/2002

The software engineers at e-Gizmo, a software development firm developing B2C applications for the emerging, high-performance canine consumables market, are not particularly happy--a library critical to their environment is taking longer to compile than they'd like.

Their development server, polaris, consists of a Sun Enterprise 2 server with a pair of 250MHz UltraSPARC II processors, 512MB of memory, and two 18GB disks--one for 1GB of swap and the root file system and the other for /export/src, where all their compilation work is done. The system is running Solaris 7 with all the current patches, and the developers are using the Forte Developer 6 update 2 compiler suite. They haven't done any tuning work.

The complaint of the software developers is that compiling one of their products, which consists of about 100,000 lines of C, takes too long--they've timed it at about 18 minutes. They want to see compilation times around 12 minutes, or about a 35% speedup, without any change to the compiler output. Furthermore, management has established that there are no funds available for this project, so they'll have to make do with the hardware they have.

Because this piece of software is pretty critical to the performance of their entire application suite, it's compiled with a very high level of optimizations: -fast -xarch=v8plusa.

Before we start looking at the system's workload, let's figure out what we might suspect to be bottlenecks and plan a tuning strategy. Based on how we think compilation stresses the system, we suspect that we will need to look at memory consumption, CPU usage, and disk I/O. Since NFS isn't involved, networking isn't likely to be a problem. So, we can formulate three questions:

System Performance Tuning

Related Reading

System Performance Tuning
By Gian-Paolo D. Musumeci, Mike Loukides

Let's take a look at the data that we actually get during our first trial run. I'll summarize representative output from vmstat 5, iostat -xnP 30, and mpstat 5 in terms of the three questions we posed above.

Is there a memory shortage? Let's take a look at the vmstat output:

polaris% vmstat 5
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s1 s2 --   in   sy   cs us sy id
 0 0 0 1425960 396952 0 727  4  1  1  0  0  0  3  0  0  141 2220  121 38 12 50
 0 0 0 1426912 397856 0 1207 8  0  0  0  0  0  4  0  0  148 3513  153 29 20 51
 0 0 0 1426072 396896 0 628  6  0  0  0  0  0  2  0  0  330 7017  930 40 14 46

Our top indicator of a memory shortage is the sr column, but it's consistently zero. If we were short of memory, the first thing we'd try is turning on priority paging. One thing that does look suspicious is the fact that we have some idle time, as shown by the id column; we'll have to take a look at that.

Are we disk I/O bound? iostat -xnP 5 output seems to indicate otherwise pretty conclusively:

polaris% iostat -xnP 5
...
                             extended device statistics
 r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0s0
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0s1
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0s2
 0.2  2.6    0.6    8.6  0.0  0.0    0.0   14.0   0   2 c0t1d0s0
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0s2
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 polaris:vold(pid208)
...

We really aren't doing much disk activity at all, and it's all being done to /export/src (c0t1d0s0); note that c0t0d0s0 is swap, and c0t0d0s1 is /.

Finally, let's take a look at mpstat and see what's going on with our CPUs:

polaris% mpstat 5
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 0  294   0  115   214   13   92    9    6    0    0   599    9   5   1  85
 1  489   0   67    28   14   39   13    8    1    0  1875   63  22   1  14
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 0  147   0  155   216   12   96   12    5    1    0   871    5   9   0  85
 1 1098   1   71    32   17   60   14   10    1    0  2890   51  32   3  14

It looks like we're only consuming one of our processors with much useful work. On a multiprocessor system like the one that e-Gizmo is using for software development, we can use dmake, which parallelizes the compilation process. Let's see if we can use that to get the CPU utilization up.

To test this, we should start by running a make clean to clean out all the old object files. Then, we'll unmount and remount /export/src in order to clear out any file system cache entries. This way, we'll be running the same test again, but this time with a parallel make. We'll start by trying two threads of compilation since we have two processors:

polaris% /bin/time -p dmake -j 2
...
real 614.14
user 867.14
sys 219.82

Our performance target is the wall-clock compilation time, which /bin/time reports as "real," so that's pretty good--we've already beat our performance target by almost two minutes! A quick look at the output of mpstat shows us that we have indeed gotten rid of almost all the idle time.

polaris% mpstat 5
...
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 0  448   0   80   221   12   83   33    9    1    0  1372   76  15   1   8
 1  440   0  105    39   14   48   24    8    1    0  1218   84  13   1   2
...

We might try "oversubscribing" our processors by adding more parallel execution threads, so we'll run a few more tests, increasing the value of j each time:

-j      real time (m:s)  user time (m:s)  sys time (m:s)
 1      18:03.03         14:14.34         3:27.34
 2      10:14.14         14:27.14         3:39.82
 3       9:41.86         14:28.37         3:39.45
 4       9:34.66         14:33.45         3:39:22

At this point, we're starting to see diminishing returns. We could probably experiment some more and drive the compilation time down by another few seconds, but the time that e-Gizmo is willing to give us for testing is starting to run awfully short.

So, it looks like -j 4 is going to be our best bet. Finally, we'll check the output of iostat and vmstat to make sure that we haven't got another easy tuning target elsewhere in the system, but it looks like we're out of luck:

polaris% vmstat 5
procs     memory            page            disk          faults      cpu
r b w   swap  free  re  mf pi po fr de sr s1 s1 s2 --   in   sy   cs us sy id
1 0 0 1412432 368488 0 1583 11 0  0  0  0  0  5  0  0  512 5491 1135 71 28  0
2 0 0 1409768 365920 0 1552 11 0  0  0  0  0 10  0  0  530 5417  812 73 27  0
2 0 0 1412144 367768 0 1460 1  0  0  0  0  0  4  0  0  561 4954  518 73 27  0

polaris% iostat -xnP 5
                             extended device statistics
 r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0s0
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0s1
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0s2
 1.2  5.0    8.6  229.0  0.0  0.2    0.0   32.3   0   3 c0t1d0s0
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0s2
 0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 islington:vold(pid208)

Nope, no easy performance pickings left, and we're about out of time on the client system, so we'll call it a day. The e-Gizmo engineers are pretty happy with their performance boost.

As we look back on this, it was a pretty straightforward problem with a pretty straightforward solution, but we know they aren't all this easy. The important thing to take away from this case study isn't how much speedup we achieved by using dmake to split the compilation across multiple CPUs, and especially not the exact value of -j that we found to be optimal, but rather it's about how we approached the problem. We figured out what the performance requirements were, formulated some hypotheses about where we might look for tuning opportunities, gathered some data while the system was running, and then acted on our data.


Gian-Paolo D. Musumeci is a research engineer in the Performance and Availability Engineering group at Sun Microsystems, where he focuses on network performance.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.