|
Electronic Archaeologyby Steve Oualline, author of Practical C++ Programming, 2nd Edition04/29/2003 |
In college, programmers study the science of programming. They learn that you start with good requirements; this leads to a well-thought-out program design; and from this you create a coding plan, which you implement to produce a nearly perfect program.
The real world is nothing like this. In the real world, most programmers spend most of their time going through code that's a hundred years old and extremely messy. If it ever was designed, the design document was lost long ago. It has evolved over the years. Hundreds of people have worked on it. And it appears most of them knew very little about programming. As a result, most professional programmers have to deal with badly designed, badly implemented, uncommented, incomprehensible blobs.
The art of digging through ancient code is called electronic archaeology and this article discusses some of the tools you can use to make your job easier.

Ryan Hocket -- Big wrench.
Jim Hocket -- Medium wrench.
Chuck Cross -- Little wrench.
Steam engine courtesy of the Poway Midland Railroad.
There are a large number of tools out there designed to make the job of the electronic archeologist easier. These include:
| Vim | vi-like text editor with lots of extra features |
| grep | Text searching program |
| find/grep | File-finding program coupled with text search |
| glimpse | High-speed text search tool |
| indent | Indents programs |
| lxr | Linux Kernel (or any large source system) cross reference |
| cpp | C/C++ pre-processor |
| Source Navigator | Source browsing tool |
The Vim editor is a vi-like editor. But while vi is a workhorse of an editor, Vim is a space-age, heavy-duty work truck with every high-tech accessory you can imagine.
|
Related Reading
|
Perhaps the newest command that Vim has is the :help command. By itself, it displays a window containing general help. If you give it the name of a command, for example :help j, it will display help for the given command.
Sometimes the simplest things are the most powerful. In the case of Vim it's the quick-search capability. Suppose you are looking at code that uses the variable foo and you want to see where else in your program foo is used. Simply put the cursor on this word and press #. Vim will search for the next occurrence of the word. You can then use the normal vi commands n and N to repeat the search until you have scanned the entire file.
The command * works in a similar manner, only it searches backwards.
These commands are extremely useful in understanding how a variable or procedure is used in a particular file.
But suppose you want to check out variables or procedures through multiple files.
That's where :grep comes in.
The :grep command can be used to search through a set of files for a given
string. For example, to search all the C source files for the string
connection_status use the command:
:grep connection_status *.c *.h
The editor will search all the files for the string and position the cursor
on the first line that matches. To go to the next matching line, use the
:cnext command. To go to the previous match, use :cprev.
Finally, if you've moved around and want to return to the current
match, use :cc.
The :copen command opens a new window that lists all the matches found.
You can navigate to an interesting entry using the normal cursor movement commands,
and then press <Enter;> to then edit that file (at the location matched).
Note: The :grep command is very similar to the :make command,
which integrates program building and editing. For more information see
The top 10 things a vi user
should know about Vim.
In programming they are called procedures. For some reason in Vim they are called tags. The Vim editor has a number of commands that let you navigate through the procedures in your code.
The first step, before you start editing, is to create a tags file, which
contains location information about the procedures in your code. This is created
using the ctags command. (This command comes with Vim.)
Simply run this command on all your source files to generate the tags database:
ctags *.c *.h
Now, when you are editing and you want to go to a particular procedure, all you have to do is execute a :tag procedure command. For example, to jump to the definition of the do_it procedure:
:tag do_it
Let's suppose that do_it calls do_part_a. To find out where this procedure is defined, move the cursor onto the procedure call and press CTRL-]. The system will take us to the definition of do_part_a. Should this function call another subroutine, you can go to the definition of that function by putting the cursor on it and pressing CTRL-].
OK, now that you've descended the procedure call stack, you may want to return to
where you started. The CTRL-T takes you back up through the call stack using CTRL-]. (So if you're in do_part_a the CTRL-T command will take you to the call in do_it.)
This system works well as long as there is only one definition of each
procedure. Unfortunately, C++ introduced overloading, so there can be multiple definitions of a function. If you want to pick which definition you want, use the :tselect function. This command displays a list of all the functions that match and lets you choose which one you want. For example:
:tselect add_it
# pri kind tag file
1 F f add_it t.c
char add_it(char i1, char i2)
2 F f add_it t.c
fixed_pt add_it(fixed_pt i1, fixed_pt i2)
3 F f add_it t.c
float add_it(float i1, float i2)
4 F f add_it t.c
int add_it(int i1, int i2)
Enter nr of choice (<CR> to abort):
The :tselect function can also be used to search for a function using a regular expression. For example, to select from all functions with the word add in them, use the command:
:tselect /add/
Finally, one of the innovations Vim introduced was multiple windows. To do a tag jump and display the result in a new window, use the command CTRL-W CTRL-]. (For a full discussion of how to use windows, execute the command :help windows.)
The Vim editor has a good understanding of C syntax and knows how to properly
indent C programs. To turn on the C style indentation, use the command :set cindent. Now when you write your programs, indentation will be done automatically for you.
The number of spaces for each indent is determined by the shiftwidth option. So if you want to indent four spaces per level, use the command:
:set sw=4
But suppose you are dealing with legacy code and it's indented badly. The Vim command
= will indent a section of code. The form of the command is =<motion>, where <motion> is a cursor's motion command.
Perhaps the most common use of this command is to indent a block of code that's been badly indented.
You start by positioning the cursor on the first curly brace ({) of the block. Then execute the command =% (= -- indent to motion, % -- go to matching brace).
The Vim editor has hundreds of additional commands. Many of these are useful
for programming and electronic archaeology. You can find out more information
using the Vim help system (:help), the Official Vim Web Site, and my web site.
|
The grep program searches a set of files for a given string. It is very useful for finding out where to define and use a variable. For example:
grep regdump *.[ch]
embed.h:#define regdump Perl_regdump
embed.h:#define regdump(a) Perl_regdump(aTHX_a)
proto.h:PERL_CALLCONV void Perl_regdump(pTHX_regexp* r);
regcomp.c:# define Perl_regdump my_regdump
regcomp.c: DEBUG_r(regdump(r));
regcomp.c: - regdump - dump a regexp onto Perl_debug_log
in vaguely comprehensible form
regcomp.c:Perl_regdump(pTHX_regexp *r)
regexec.c:# define Perl_regdump my_regdump
This example searches all the .c and .h files for the name regdump.
But sometimes we want to search all the files. The grep command can
do this, but searching binary files produces a lot of junk. Binary
characters can do mean things to terminals, so we need a way to
convert them into something printable.
So, to search a complete set of files (including binary ones), use the command:
grep regdump * | cat -v | cut -c 1-80
cat -v
Turns unprintable characters into something readable.
cut -c 1-80
Binary files have long "lines". This command trims them to 80 characters long for viewing and printing.
Example:
grep regdump * | cat -v | cut -c 1-80
grep: Cross: Is a directory
grep: NetWare: Is a directory
...
embed.fnc:Ap |void |regdump |regexp* r
embed.h:#define regdump Perl_regdump
embed.h:#define regdump(a) Perl_regdump(aTHX_a)
global.sym:Perl_regdump
libperl.a:^@^PM-^K@^PMPM-^KM-^J^@^PM-^KM-^J^@
libperl.a:^^@^@^@^H^@^@^@P^@^@^@^A^@^@^@^@^@^
miniperl:^Hn^@-^N^^@^LM-3^D^H7^@^@^@^R^@^@^@s
perl:^@R^@^M^@`M-F ^H1^@^@^@^R^@^M^@M-^WM-^
....
The Vim editor can be used to view the results of a grep command. For example:
grep regdump * | cat -v | gvim -

Useful Vim commands:
:set nowrap
Don't write lines.
gf
Go to the file who's name is under the cursor.
The find command is useful for going through a directory tree and locating files.
The grep command searches files for a given text string.
You can combine the two to create a system for searching a directory tree for a variable.
find . \( -name "*.cpp" -o -name "*.h" \)
-exec fgrep what {} /dev/null \;
find . | Find starting at current directory |
\( \) | Group operation |
-name "*.cpp" -o -name "*.h" |
All C++ or H files |
-exec | Command to execute |
-exec fgrep | Execute fgrep command |
{} | On the current file |
/dev/null | Also search /dev/null |
\; | End of command |
Note: fgrep only prints the filename when two files are searched, thus /dev/null.
Here's an example:
find . -name *.[ch] -exec fgrep regdump {} /dev/null \;
./embed.h:#define regdump Perl_regdump
./embed.h:#define regdump(a)Perl_regdump()
./ext/re/re_exec.c:#define Perl_regdump my_regdump
./ext/re/re_comp.c:#define Perl_regdump my_regdump
./ext/re/re_comp.c: DEBUG_r(regdump(r));
./ext/re/re_comp.c: - regdump - dump a regexp onto Perl_debug_log
in vaguely comprehensible form
./ext/re/re_comp.c:Perl_regdump(pTHX_ regexp *r)
....
The GNU version of grep has a -r option, which allows you to recursively search a directory tree. For example:
fgrep -r regdump .| cat -v | cut -c 1-80
find/grep versus grep -r
find/grep |
grep -r |
|
| Speed | Not that fast | Faster |
Can be limited to certain files (i.e. *.c)? |
Yes | No |
| Standard on all UNIX systems? | Yes | No |
|
The Glimpse system is a high-speed text-indexing system. It works by scanning all your files and building up a database of words and their locations. Then, when you ask it where a word is located, all it has to do is go to the database and display the results.
The advantage here is that it is super fast. Even searches through huge volumes of text can be done quickly and easily.
But there are some disadvantages. The first is if you change a file, the changes are not reflected in the database until you rebuild the index. Thus, the information in the database may not be current.
Second, building the database takes time.
Finally, the Glimpse system is distributed with a restrictive license, which may prevent you from using it in certain circumstances.
Let's see how to use Glimpse on a large set of source files. In this example, we've downloaded the source to OpenOffice.org. (This being the largest single GPL project we know of.)
First we need to run the glimpseindex command to index the database.
find oo_1.0.1_src \( -name *.h -o -name *.c* \) -print | \
glimpseindex -H /home/sdo/muck/tools -F
-H
Specify directory for the database files.
-F
Read list of files to index from standard in.
To perform the search we use the glimpse command:
glimpse -H /home/sdo/muck/tools -n linux
-H
Specify the location of the database.
-n
Print the line number where the match occurs.
The following is a sample glimpseindex run:
find oo_1.0.1_src \( -name *.h -o -name *.c* \) -print | \
glimpseindex -H /home/sdo/muck/tools -F
This is glimpseindex version 4.16.2, 2002.
Indexing "oo_1.0.1_src/common/english_us/custom.css" ...
Indexing "oo_1.0.1_src/sbasic/english_us/sbasic.cfg" ...
Indexing "oo_1.0.1_src/parser_i/tokens/tkpcont2.cxx" ...
...
Indexing "oo_1.0.1_src/parser_i/tokens/tkpstam2.cxx" ...
Indexing "oo_1.0.1_src/autodoc/source/tools/tkpchars.cxx" ...
Size of files being indexed = 137736911 B, Total #of files = 7600
Now that we have the index, we can search it. Here's an example:
glimpse -H /home/sdo/muck/tools -n linux
Your query may search about 24% of the total space!
Continue? (y/n) y
oo_1.0.1_src/dbaccess/source/ui/browser/dsbrowserDnD.cxx: 1279:
* #65293# linux ambiguity
oo_1.0.1_src/dbaccess/source/ui/dlg/dsselect.cxx: 223:
* #65293# cant compile for linux
oo_1.0.1_src/dbaccess/source/ui/dlg/indexdialog.cxx: 913:
* Some error checked for linux
oo_1.0.1_src/dbaccess/source/ui/dlg/odbcconfig.cxx: 352:
* #65293# cant compile for linux
oo_1.0.1_src/dbaccess/source/ui/dlg/sqlmessage.cxx: 603:
* Syntax error with linux compiler
oo_1.0.1_src/svx/source/fmcomp/fmgridif.cxx: 1642:
// the same props as in addColumnListeners ... linux has
problems with global static UStrings, so
As code evolves different people work on it. If the people in charge don't have a standard indentation style (and most don't), different programmers will use different indentation techniques. This makes the code difficult to read and understand.
But by processing the code through indent you can standardize the style. The result is code that is easier to understand.
It should be noted that indent won't work on all programs. There is still some syntax that will fool it.
The indent program came in handy when I once had to deal with "Jeff" code. Jeff was an unusual programmer who believed it was best to put the maximum amount of code on the screen at one time. As a result, he didn't put in any more whitespace than he had to. So his code started in the first column, and extended completely over to the right margin. Also, he didn't believe in comments. After all, the code was what counted, and he got as much of it on the screen as possible.
The result was something unreadable, except to Jeff:
BOOL SbiParser::Parse() { if(bAbort) return FALSE; EnableErrors();
Peek(); if(IsEof()) { if( bNewGblDefs&&nGblChain==0)
nGblChain=aGen.Gen( _JUMP, 0); return FALSE; } if(IsEoln(eCurTok))
{ Next(); return TRUE; } if (!bSingleLineIf&& MayBeLabel(TRUE))
{ if(!pProc) Error( SbERR_NOT_IN_MAIN,aSym); else pProc->
GetLabels().Define( aSym); Next(); Peek(); if( IsEoln(eCurTok))
{ Next(); return TRUE; }} if( eCurTok==eEndTok) { Next(); if(eCurTok!=NIL)
aGen.Statement(); return FALSE; } if (eCurTok== REM) { Next(); return TRUE; }
if (eCurTok==SYMBOL ||eCurTok==DOT) { if (!pProc) Error( SbERR_EXPECTED,SUB );
else { Next(); Push( eCurTok); aGen.Statement(); Symbol(); }}}
When Jeff left the company I got his code. The first thing I did was run the program through indent. The result was something I could deal with:
BOOL SbiParser::Parse ()
{
if (bAbort)
return FALSE;
EnableErrors ();
Peek ();
if (IsEof ())
{
if (bNewGblDefs && nGblChain == 0)
nGblChain = aGen.Gen (_JUMP, 0);
return FALSE;
}
if (IsEoln (eCurTok))
{
Next ();
return TRUE;
}
if (!bSingleLineIf && MayBeLabel (TRUE))
{
if (!pProc)
Error (SbERR_NOT_IN_MAIN, aSym);
else
pProc->GetLabels().Define(aSym);
Next();
Peek();
if (IsEoln(eCurTok))
{
Next();
return TRUE;
}
}
if (eCurTok == eEndTok)
{
Next();
if (eCurTok != NIL)
aGen.Statement ();
return FALSE;
}
if (eCurTok == REM)
{
Next();
return TRUE;
}
if (eCurTok == SYMBOL || eCurTok == DOT)
{
if (!pProc)
Error (SbERR_EXPECTED, SUB);
else
{
Next();
Push (eCurTok);
aGen.Statement();
Symbol();
}
}
}
As you can see, the indent program turned "Jeff" code into something that is possible to maintain.
|
This system is used to provide a cross reference of the Linux Kernel as well as the Mozilla browser. It uses a web browser as a user interface.
The left image shows a code listing. If you select an identifier you are taken to a page listing at every place the identifier occurs. Clicking on one of the items here takes you to the location in the file.
One of the problems with this program is that you must have access to a web server to install this. Also don't attempt to install this program unless you understand Perl well. (It has some of the most complex and uncommented regular expressions I've ever seen.)
It takes time to make a good mess. Programs start out simple. But then the code gets ported to a new platform. A few #ifdef directives are added to handle platform difference. But then marketing wants a new variant of the program. So more #ifdef directives are added.
As time goes on, more and more conditional compilation directives are added, resulting in more and more complex code. Many of these directives are for platforms or variants that haven't been used in years, but they still remain because no one knows how to take them out.
It soon becomes impossible to tell what's compiled and what's not. That's where the pre-processor comes in. By running the code through the pre-processor, you can tell what's really compiled.
There is a simple technique for making sure that your pre-processor results reflect your current compilation.
First, compile the code normally and redirect the output of make to
a log file:
$ make strange.o
g++ -g -DUNIX -DM8K -DPROD -DECHO -DSPEEDUP_CODE -c strange.cpp
Find the compilation command in the log file. Write it to a shell script.
Edit the script and change -c (and other flags) to -E (pre-processor
output).
Finally, run the shell script:
$ sh -x pre.sh
g++ -DUNIX -DM8K -DPROD -DECHO -DSPEEDUP_CODE -E strange.cpp > strange.pre
The following table illustrates the results:
#ifdef LOCATE_CALLS
char *save_block(char *block, int size,
char *file, int line)
#endif
#ifndef LOCATE_CALLS
char *save_block(char *block, int size)
#endif
{
#ifndef NATIVE_MALLOC
char *result;
#endif
#ifdef LOG_CALLS
log_message(file, line, "save_block(%p,
%d)", block, size);
#endif LOG_CALLS
#ifdef NATIVE_MALLOC
return (memcpy(malloc(size), block, size));
#endif
#ifdef SINGLE_BLOCK_ALLOC
result = find_memory(size);
result += sizeof(struct mem_header);
memcpy(result, block, size);
return (result);
#endif
#ifdef OOO_INTERFACE
#ifdef LOCATE_CALLS
return ((*alloc)(file, line, block, size));
#else
return ((*alloc)(block, size));
#endif
#endif
} |
# 1 "cpp_hell.cpp"
char *save_block(char *block, int size)
{
return (memcpy(malloc(size), block, size));
} |
The Vim editor contains some nice commands that make it easy to view your source file alongside the pre-processor output.
Start by editing both files with the gvim by using the command:
gvim strange.cpp strange.pre
Set the scrollbind option (:set scrollbind).
Split windows vertically (:vsplit) and go to the next file (:next).
You now have two windows side by side, with the source file and the
pre-processed file. (If you need help with window navigation in
Vim, execute the command :help windows.

The Source Navigator tool provides an integrated development environment. It provides a nice editor, a source browser, a class browser, and a cross reference.
But it's not perfect. Building the index is slow and the command interface is not the easiest to use. It also breaks on large, complex source packages.
I don't have much experience with this package. Every time I've tried it, the sources have been just too large and weird and have caused Source Navigator to crash.
O'Reilly & Associates recently released (December 2002) Practical C++ Programming, 2nd Edition.
Sample Chapter 26: Program Design, is available free online.
You can also look at the Table of Contents, the Index, and the Full Description of the book.
For more information, or to order the book, click here.
Steve Oualline wrote his first program when he was eleven. He has written almost a dozen books on programming and Linux software.
Return to the O'Reilly Network.
Copyright © 2007 O'Reilly Media, Inc.