advertisement

Print
Secure Programming Cookbook for C and C++

Input Validation in C and C++

by John Viega and Matt Messier, authors of O'Reilly's upcoming Secure Programming Cookbook for C and C++
05/20/2003

Eavesdropping attacks are often easy to launch, but most people don't worry about them in their applications. Instead, they tend to worry about what malicious things can be done to the machine on which the application is running. Most people are far more worried about active attacks than they are about passive attacks.

Nearly every active attack out there is the result of some kind of input from an attacker. Secure programming is about making sure that inputs from bad people do not do bad things. Indeed, most of the soon-to-be-released Secure Programming Cookbook for C and C++ addresses how to deal with malicious inputs. For example, cryptography and a strong authentication protocol can help prevent attackers from capturing someone's login credentials and sending those credentials as input to the program.

If this entire cookbook focuses primarily on preventing malicious inputs, then why do we have a chapter of recipes specifically devoted to this topic? It's because this chapter is about one important class of defensive techniques: input validation.

In Recipe 3.3 below on preventing buffer overflows, and in all of the recipes in the book's "Input Validation" chapter, we assume that people are connected to our software, and that some of them may send malicious data (even if we think there is a trusted client on the other end). One thing we really care about is this: "What does our application do with that data? In particular, does the program take data that should be untrusted and do something potentially security-critical with it? More importantly, can any untrusted data be used to manipulate the application or the underlying system in a way that has security implications?"

Related Reading

Secure Programming Cookbook for C and C++
Recipes for Cryptography, Authentication, Input Validation & More
By John Viega, Matt Messier

Recipe 3.3: Preventing Buffer Overflows

Problem

C and C++ do not perform array-bounds checking, which turns out to be a security-critical issue, particularly in handling strings. The risks increase even more dramatically when user-controlled data is on the program stack (i.e., is a local variable).

Solution

There are many solutions to this problem, but none that are satisfying in every situation. You may want to rely on operational protections (such as StackGuard from Immunix), use a library for safe string handling, or even use a different programming language.

Discussion

Buffer overflows get a lot of attention in the technical world, partially because they constitute one of the largest classes of security problems in code, but also because they have been around for a long time, are easy to get rid of, and yet still are a huge problem.

Buffer overflows are generally very easy for a C or C++ programmer to understand. An experienced programmer has invariably written off the end of an array, or indexed into the wrong memory because he improperly checked the value of the index variable.

Because we assume that you are a C or C++ programmer, we won't insult your intelligence by explaining buffer overflows to you. If you do not already understand the concept, you can consult many other software security books, including Building Secure Software. In this recipe, we won't even focus so much on why buffer overflows are such a big deal. Other resources can help you understand that if you're insatiably curious. Instead, we'll focus on state of the art strategies for mitigating these problems.

Most languages do not have this problem at all, because they ensure that writes to memory are always in bounds. Sometimes, this can be done at compile time, but generally it is done dynamically, right before data gets written. The C and C++ philosophy is different -- you are given the ability to eke out more speed, even if it means that you risk shooting yourself in the foot.

String Handling in C and C++

Unfortunately, in C and C++, it is not only possible to overflow buffers -- it is easy, particularly when dealing with strings. The problem is that C strings are not high-level data types; they are arrays of characters. The major consequence of this nonabstraction is that the language does not manage the length of strings; you have to do it yourself. The only time C ever cares about the length of a string is in the standard library, and the length is not related to the allocated size at all -- instead, it is delimited by a 0-valued (NULL) byte. Needless to say, this can be extremely error-prone.

One of the simplest examples is the ANSI C standard library function, gets():

char *gets(char *str);

This function reads data from the standard input device into the memory pointed to by str until there is a newline or until the end of file is reached. It then returns a pointer to the buffer. In addition, the function NULL-terminates the buffer.

The problem with this function is that no matter how big the buffer is, an attacker can always stick more data into the buffer than it is designed to hold, simply by avoiding the newline.

If the buffer in question is a local variable or otherwise lives on the program stack, then the attacker can often force the program to execute arbitrary code by overwriting important data on the stack. This is called a stack-smashing attack. Even when the buffer is heap allocated (that is, it is allocated with malloc() or new), a buffer overflow can be security-critical if an attacker can write over critical data that happens to be in nearby memory.

There are plenty of other places where it is easy to overflow strings. Pretty much any time you perform an operation that writes to a "string," there is room for a problem. One famous example is strcpy():

char *strcpy(char *dst, const char *src);

This function copies bytes from the address indicated by src into the buffer pointed to by dst, up to and including the first NULL byte in src. Then it returns dst. No effort is made to ensure that the dst buffer is big enough to hold the contents of the src buffer. Because the language does not track allocated sizes, there is no way for the function to do so.

To help alleviate the problems with functions like strcpy() that have no way of determining whether the destination buffer is big enough to hold the result from their respective operations, there are also functions like strncpy():

char *strncpy(char *dst, const char *src, size_t len);

The strncpy() function is certainly an improvement over strcpy(), but there are still problems with it. Most notably, if the source buffer contains more data than the limit imposed by the len argument, the destination buffer will not be NULL-terminated. This leads to the need for the programmer to ensure that the destination buffer is NULL-terminated. Unfortunately, the programmer often forgets to do so. There are two reasons for this failure:

  • It's an additional step for what should be a simple operation.

  • Many programmers do not realize that the destination buffer may not be NULL-terminated.

The problems with strncpy() are further complicated by the fact that a similar function, strncat(), treats its length-limiting argument in a completely different manner. The difference in behavior serves only to confuse programmers, and more often than not, mistakes are made. Certainly, we recommend using strncpy() over using strcpy(); however, there better solutions.

OpenBSD 2.4 introduced two new functions, strlcpy() and strlcat(), that are consistent in their behavior, and they provide an indication back to the caller of how much space in the destination buffer would be required to successfully complete their respective operations without truncating the results. For both functions, the length limit indicates the maximum size of the destination buffer, and the destination buffer is always NULL-terminated, even if the destination buffer must be truncated.

Unfortunately, strlcpy() and strlcat() are not available on all platforms; at present, they seem to be available only on Darwin, FreeBSD, NetBSD, and OpenBSD. Fortunately, they are easy to implement yourself -- but you don't have to do so, because we provide implementations here:

#include <sys/types.h>
#include <string.h>

size_t strlcpy(char *dst, const char *src, size_t size) {
  char       *dstptr = dst;
  size_t     tocopy  = size;
  const char *srcptr = src;

  if (tocopy && --tocopy) {
    do {
      if (!(*dstptr++ = *srcptr++)) break;
    } while (--tocopy);
  }
  if (!tocopy) {
    if (size) *dstptr = 0;
    while (*srcptr++);
  }

  return (srcptr - src - 1);
}

size_t strlcat(char *dst, const char *src, size_t len) {
  char       *dstptr = dst;
  size_t     dstlen, tocopy;
  const char *srcptr = src;

  while (tocopy-- && *dstptr) dstptr++;
  dstlen = dstptr - dst;
  if (!(tocopy = size - dstlen)) return (dstlen + strlen(src));
  while (*strptr) {
    if (tocopy != 1) {
      *dstptr++ = *srcptr;
      tocopy--;
    }
    srcptr++;
  }
  *dstptr = 0;

  return (dstlen + (srcptr - src));
}

As part of its security push, Microsoft has developed a new set of string-handling functions for C and C++ that are defined in the header file strsafe.h. The new functions handle both ANSI and Unicode character sets, and each function is available in byte-count and character-count versions. For more information regarding using strsafe.h functions in your Windows programs, visit the MSDN reference for strsafe.h.

All of the string-handling improvements we've discussed so far operate using traditional C-style NULL-terminated strings. While strlcat(), strlcpy(), and Microsoft's new string-handling functions are vast improvements over the traditional C string-handling functions, they all still require diligence on the part of the programmer to maintain information regarding the allocated size of destination buffers.

An alternative to using traditional C-style strings is to use the SafeStr library, which is available from www.zork.org/safestr. The library is a safe string implementation that provides a new, high-level data type for strings, tracks accounting information for strings, and performs many other operations. For interoperability purposes, SafeStr strings can be passed to C string calls, as long as those calls use the string in a read-only manner. (We discuss SafeStr in some detail in Recipe 3.4 of the upcoming Secure Programming Cookbook for C and C++.)

Finally, applications that transfer strings across a network should consider including a string's length along with the string itself, rather than requiring the recipient to rely on finding the NULL-terminating character to determine the length of the string. If the length of the string is known up front, the recipient can allocate a buffer of the proper size up front, and read the appropriate amount of data into it. The alternative is to read byte-by-byte, looking for the NULL-terminator, and possibly repeatedly resizing the buffer. Dan J. Bernstein has defined a convention called Netstrings for encoding the length of a string with the strings. This protocol simply would have you send the length of the string represented in ASCII, then a colon, then the string itself, then a trailing comma. For example, if you were to send the string "Hello, world!" over a network, you would send:


14:Hello, world!,

Note that the Netstring representation does not include the NULL-terminator, as that is really part of the machine-specific representation of a string, and is not necessary on the network.

Using C++

When using C++, you generally have a lot less to worry about when using the standard C++ string library, std::string. This library is designed in such a way that buffer overflows are less likely. Standard I/O using the stream operators (>> and <<) is safe when using the standard C++ string type.

However, buffer overflows when using strings in C++ are not out of the question. First, the programmer may choose to use old-fashioned C API calls, which work fine in C++ but are just as risky as they are in C.

Second, while C++ usually throws an out_of_range exception when an operation would overflow a buffer, there are two cases where it doesn't.

The first problem area occurs when using the subscript operator []. This operator doesn't perform bounds checking for you, so be careful with it.

The second problem area is when using C-style strings with the C++ standard library. C-style strings are always a risk, because even C++ doesn't know how much memory is allocated to a string. Consider the following C++ program:

#include <iostream.h>

// WARNING: This code has a buffer overflow in it.
int main() {
   char buf[12];

   cin >> buf;
   cout << "You said... " << buf << endl;
}

If you compile the above program without optimization and then run it, typing in more than 11 printable ASCII characters (remember that C++ will add a NULL to the end of the string), the program will either crash or print out more characters than buf can store. Those extra characters get written past the end of buf.

Also, when indexing a C-style string through C++, C++ always assumes that the indexing is valid, even if it isn't.

Another problem occurs when converting C++-style strings to C-style strings. If you use string::c_str() to do the conversion, you will get a properly NULL-terminated, C-style string. However, if you use string::data(), which writes the string directly into an array (returning a pointer to the array), you will get a buffer that is not NULL-terminated. That is, the only difference between c_str() and data() is that c_str() adds a trailing NULL.

One final point with regard to C++ is that there are plenty of applications not using the standard string library and are instead using third-party libraries. Such libraries are of varying quality when it comes to security. We recommend using the standard library, if at all possible. Otherwise, be careful in understanding the semantics of the library you do use, and the possibilities for buffer overflow.

Stack-Protection Technologies

In C and C++, memory for local variables is allocated on the stack. In addition, information pertaining to the control flow of a program is also maintained on the stack. If an array is allocated on the stack, and that array is overrun, an attacker can overwrite the control flow information that is also stored on the stack. As we mentioned above, this type of attack is often referred to as a stack-smashing attack.

Recognizing the gravity of stack-smashing attacks, several technologies have been developed that attempt to protect programs against them. These technologies take various approaches. Some are implemented in the compiler (such as Microsoft's /GS compiler flag and IBM's ProPolice), while others are dynamic runtime solutions (such as Avaya Labs's LibSafe).

All of the compiler-based solutions work in much the same way, although there are some differences in the implementations. They work by placing a "canary" (which is typically some random value) on the stack between the control flow information and the local variables. The code that is normally generated by the compiler to return from the function is modified to check the value of the canary on the stack, and if it is not what it is supposed to be, the program is terminated immediately.

The idea behind using a canary is that an attacker attempting to mount a stack-smashing attack will have to overwrite the canary in order to overwrite the control flow information. By choosing a random value for the canary, the attacker cannot know what it is and thus be able to include it in the data used to "smash" the stack.

When a program is distributed in source form, the developer of the program cannot enforce the use of StackGuard or ProPolice because they are both nonstandard extensions to the GCC compiler. It is the responsibility of the person compiling the program to make use of one of these technologies. On the other hand, although it is rare for Windows programs to be distributed in source form, the /GS compiler flag is a standard part of the Microsoft Visual C++ compiler, and the program's build scripts (whether they are make files, DevStudio project files, or something else entirely), can enforce the use of the flag.

For Linux systems, Avaya Labs' LibSafe technology is not implemented as a compiler extension, but instead takes advantage of a feature of the dynamic loader that causes a dynamic library to be preloaded with every executable. Using LibSafe does not require the source code for the programs it protects, and it can be deployed on a system-wide basis.

LibSafe works by replacing the implementation of several standard functions that are known to be vulnerable to buffer overflows, such as gets(), strcpy(), and scanf(). The replacement implementations attempt to compute the maximum possible size of a statically allocated buffer used as a destination buffer for writing using a GCC built-in function that returns the address of the frame pointer. That address is normally the first piece of information on the stack after local variables. If an attempt is made to write more than the estimated size of the buffer, the program is terminated.

Unfortunately, there are several problems with the approach taken by LibSafe. One problem is that it cannot accurately compute the size of a buffer; the best that it can do is limit the size of the buffer to the difference between the start of the buffer and the frame pointer. The other problem is that LibSafe's protections will not work with programs that were compiled using the -fomit-frame-pointer flag to GCC, an optimization that causes the compiler to not put a frame pointer on the stack. Although relatively useless, this is a popular optimization for programmers to employ.

In addition to providing protection against conventional stack-smashing attacks, the newest versions of LibSafe also provide some protection against format-string attacks. The format-string protection also requires access to the frame pointer, because it attempts to filter out arguments that are not pointers into the heap or the local variables on the stack.

See Also:

John Viega is CTO of the SaaS Business Unit at McAfee and the author of many security books, including Building Secure Software (Addison-Wesley), Network Security with OpenSSL (O'Reilly), and the forthcoming Myths of Security (O'Reilly).

Matt Messier is Director of Engineering at Secure Software, and coauthor of O'Reilly's "Network Security with OpenSSL."


Return to the O'Reilly Network.