ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Linux Compatibility on BSD for the PPC Platform: Part 4

by Emmanuel Dreyfus
06/21/2001

In part 3, we had a look at several bug fixes that helped us get some interesting Linux applications working. However, we had a few bugs remaining that broke even more interesting binaries such as the Java Development Kit (JDK) or RealPlayer. In this part, we'll focus on the bug fixes that are needed in order to get a working Java Virtual Machine (JVM) for PowerPC-based ports of NetBSD. Surprisingly, most of the bugs we encountered here were located in machine-independent code, and they did not caused any known problem on alpha, i386, and m68k Linux emulations.

A tricky bug fix: The brk() issue

At first, the Java Virtual Machine for Linux seemed like a very interesting binary to run on the PowerPC, but the Linux compatibility was not accurate enough to get it working. We tried the Blackdown team's JDK, and when invoking the JVM, it hung. This was caused by the JVM native thread model, which makes use of Linux Real Time signals. We have not had a look at RT signal emulation yet, and there were obviously bugs in it. Fortunately, the JVM provides an alternative threading model, enabled by the -green flag -- the Green Threads.

Green Threads do not make use of RT signals so they are more likely to run under emulation. But a simple test quickly exhibits a new problem:

$ java -green -version
java: ../../../../../src/solaris/hpi/green_threads/src/dl-malloc.c:1636:
malloc_extend_top: Assertion '((size_t)((char*)(((mbinptr)(&(av_[2 *
(0)])))->fd) + top_size) & (pagesz - 1)) == 0' failed.
Abort (core dumped)

Tracing the process with ktrace(1), we can see that this error is triggered just after a set of brk() calls. The brk() system call is used by Unix processes to increase the size of the heap. Upon a brk() call, the kernel maps new pages of memory on the top of the processes' heap. Then the kernel returns the new heap top address to the process. This address is called the break value.

The brk() syntax is used to set the break value at an absolute address. When called with a null value, brk() will return the current break value. There is also an sbrk() library call, which is used to move the break value using a relative offset. The sbrk() syntax is implemented as two brk() calls. The first call is made with a null value, to get the current break address, and the second call is made to set the break at the current value augmented by the offset. For more information, please have a look at the brk(2) man page.

Debugging the brk() problem in the JVM without a look to the sources could have been tricky, but thanks to Kevin Hendricks from the Blackdown team, it was possible to work on a test program that reproduced the problem.

The failure was caused by alignment issues: The JVM wants the break value to be aligned on a page boundary. It therefore makes two sbrk calls: one to allocate some space and get the resulting break value, and the second to adjust the break value so that it ends up on a page boundary. The final value is then tested for page alignement, and a non-alignment triggers the assertion we saw. When tracing the JVM calls to brk() while running natively on a Linux system, we get this:

brk(0)          = 0x100109d8
brk(0x100109f9) = 0x100109f9
brk(0)          = 0x100109f9
brk(0x20021000) = 0x20021000

Each couple of brk() is the result of a sbrk() library call. And here is the result when running the JVM in emulation on NetBSD:

brk(0)          = 0x10011000
brk(0x10011021) = 0x10011021
brk(0)          = 0x10012000
brk(0x12012fdf) = 0x12012fdf

The difference is obvious: NetBSD ended up with a non-aligned break value, and this is what led us to the assertion failure. The question is why did NetBSD get a non-aligned break value, and the answer is that NetBSD just returned the value requested by the calling process, here 0x12012fdf.

Previously in this series

Linux Compatibility on BSD for the PPC Platform -- The Linux compatibility layer allows BSD to run Linux binary applications. Emmanuel Dreyfus explains how he implemented this on NetBSD for the PowerPC.

Linux Compatibility on BSD for the PPC Platform: Part 2 -- Emmanuel Dreyfus takes a look at how to prevent dynamic Linux binary compatibility problems on the NetBSD/PowerPC platform.

Linux Compatibility on BSD for the PPC Platform: Part 3 -- Signals are the interactions between the kernel and the user program -- a program can't run without them. Emmanuel Dreyfus explains how to make your signals Linux-compatible.

In fact, the problem is that between the second brk() call and the third brk() call, the break value presented by the NetBSD kernel to the user process has changed. And the JVM uses the return value of the second brk() call to compute the offset needed to page align the break value. Here, with a return value of 0x10011021, the JVM knows that a 0xfdf adjustment is needed in order to reach a page-aligned address. The JVM then calls sbrk() with a 0xfdf offset. Unfortunately, the actual break value is now 0x12012000. Adding 0xfdf to it leads us to 0x12012fdf, and this address is not page aligned.

The explanation of this break value inconsistency is that NetBSD always sets the break value to a page-aligned address. On a brk() system call with a non-aligned address, it returns the requested value to the user process while the real break value is set to a page-aligned adress. This is why on the next brk() call, the break value read is not at the same address. The idea behind this behavior is that you only have to call brk() once to get a page-aligned break value. Linux, on the other hand, can set the break value to a non-page-aligned address, and you need to call brk() at least two times to get a page-aligned break.

Brk() emulation can be fixed by just keeping track of Linux processes' idea of break values. The kernel keeps setting the break values on page-aligned addresses while returning the requested address to the user process. This address may not be page aligned. The fix is to keep track of this returned value, and return it as the break address on the next brk() call.

We just have to find a place where to store the process idea of the break value. It fits nicely in the struct linux_emuldata (defined in sys/compat/linux/common/linux_emuldata.h), which is referenced by the *p_emuldata member of Linux processes' struct process (defined in sys/sys/proc.h). The new field of struct linux_emuldata we introduce to keep track of the process idea of the break value is called p_break.

With this fix to the way Linux brk() is emulated, we are able to get minimal support for the Java Virtual Machine. A simple program such as a "Hello world" was working:

/* Hello.java -- A simple test for the JVM */
public class Hello {
        static public void main (String[] args) {
                System.out.println("Hello");
        }
}

Inconsistent signal delivery

Working on Java tests, new problems apeared. The most obvious was that it was impossible to pass any test featuring a native program launch from a Java program. In fact, the native program was launched, but the Java program was not notified of its child's death. This problem was not Java specific, but it has only been possible to reproduce it with a Java program.

// exec_test.java -- run a native program
import java.lang.*;
import java.io.*;

class exec_test
{
  static Process pid;
  static String cmdstring = "/bin/ps";

  public static boolean execIt(String argv, String pname)
    {
        try {
          System.out.println(" Will call "+pname);
          pid = Runtime.getRuntime().exec(argv);
        } catch (IOException e) {
          System.out.println("Failed to execute "+pname);
          return false;
        }
        System.out.println("Waiting for "+pname+" to die");
        try{pid.waitFor();}
        catch(InterruptedException e){return false;}
        System.out.println("end of "+pname);
        return true;
    }

    public static void main(String args[])
    {
        System.out.println("In exec_test");
        execIt(cmdstring,"Testing /bin/ps");
    }
  }

The program basically launches /bin/ps and waits for its death. Sometimes it works; sometimes it fails. Success is somewhat related to the load average, but it is not completely related. It was only possible to see the bug effect on a particular race condition between different signals. This made the bug extremely difficult to spot.

Hendricks was finally able to find what was wrong by using the logging feature of the JDK. This is done by executing the Java program with the java_g syntax. Note that this feature has been disabled in JDK-1.3.0. We used JDK-1.1.8 for the tests.

Using java_g -green -l6 exec_test, we got a lot of output, including a line complaining about an unexpected signal 20, where we expected a SIGCHLD. Having a quick look to NetBSD's sys/sys/signal.h shows that signal 20 is NetBSD's SIGCHLD. In Linux, SIGCHLD is signal 17. The trace also complained about a bad signal 23 instead of a SIGIO. For NetBSD, signal 23 is SIGIO, and for Linux, it's signal 29.

Obviously, the signal numbers are not being correctly translated between NetBSD and Linux. The first idea I tried was to check carefully the native_to_linux_sig[] array in sys/compat/linux/common/linux_signal.c in case signal numbers were mixed. This was not the case.

The next step is to check the linux_sendsig() function in sys/compat/linux/arch/powerpc/linux_machdep.c, which is responsible for sending signals to Linux processes. This function takes a sig parameter, which is the signal number. This sig parameter is used twice in the linux_sendsig() function.

First, when building the Linux struct sigcontext on the processes' stack, this structure has a field to hold the signal number. The signal number is copied here with the appropriate translation:

/*
 * Prepare a sigcontext for later.
 */
sc.lsignal = (int)native_to_linux_sig[sig];
sc.lhandler = (unsigned long)catcher;
native_to_linux_old_sigset(mask, &sc.lmask);
sc.lregs = (struct linux_pt_regs*)fp;

Second, when setting up the trap frame prior transfering control to the signal trampoline, the appropriate translation was missing :

/*
 * Set the registers according to how the Linux 
 * process expects them
 */
tf->fixreg[1] = (int)fp;
tf->lr = (int)catcher;
tf->fixreg[3] = (int)sig;
tf->fixreg[4] = (int)&fp->lgp_regs;
tf->srr0 = (int)p->p_sigctx.ps_sigcode;

Once the problem was found, it was quite easy to fix by changing the third line in the above code fragment:

tf->fixreg[3] = (int)native_to_linux_sig[sig];

With this fix, Java programs forking native programs are able to work without suffering random failures. It also has the side effect of fixing the mail and news part of Netscape Communicator that was previously broken.

Non-standard behavior of asynchronous I/O

The previous fix helps the JDK a lot, but there are still some rare hangs. One can be observed when building Apache foundation's Jakarta-Ant, the make(1)-like build utility for Java. Another hang occurred when attempting to run Jakarta-Tomcat, the Apache foundation's JSP server. In this section, we will focus on the problem with Jakarta-Ant.

The offending program here was javac, the Java compiler. The problem was obviously emulation related because it was possible to successfully build Jakarta-Ant using a native build of Jikes, the Java compiler written in C.

The JDK-1.2.2 logging feature was again very useful. For the Java compiler, this can be enabled by invoking javac_g -J-Xl6 (no space after the J) instead of just javac. This is worth the comment because the -J flag is not documented except in the JDK sources.

Note that anyone can get the JDK sources, the only requirement is to make an agreement with Sun. But be aware that reading the JDK sources will make you unable to contribute to any open-source Java implementation such as Kaffe.

Running ktrace(1) against javac_g with full logging enabled, Hendricks was able to discover that the hang was caused by a spurious SIGIO. We then tried a few C programs that reproduced what the JDK was doing, and we ended with this test program:

/*
 * sigio2.c -- Test asynchronous I/O for pipes
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <fcntl.h>

void io_sighandler (int sig) {
  printf ("pid=%d got sigio\n", getpid ());
  printf ("I GOT SIGIO\n");
  exit (-1);
}

int main (int argc, char** argv) {
  struct sigaction aio;
  int fdsync[2];
  int err;
  char c;
  sigset_t set;

  sigemptyset(&set);
  sigaddset(&set,SIGIO);

  aio.sa_flags = SA_RESTART;
  aio.sa_handler = io_sighandler;
  sigemptyset(&aio.sa_mask);
  if (sigaction(SIGIO, &aio, 0) == -1) {
      printf("Error: Bad return value from sigaction call\n");
      exit(1);
  }

  if (pipe(fdsync) < 0) {
      printf("Error: bad pipe call\n");
      exit(1);
  }

  /* now set the pipe write end to be non-blocking async */
  fcntl(fdsync[1],F_SETFL, O_NONBLOCK | FASYNC);
  fcntl(fdsync[1],F_SETOWN, getpid());

  err = write(fdsync[1], "AAAA", 4);
  if (err < 0) {
      printf("write() got err=%d\n", err);
  }
  printf ("written %d bytes\n", err);

  sleep (1);

  do
      err = read(fdsync[0], &c, 1);
  while (!err);
  printf ("readen %d bytes\n", err);

  printf("NO SIGIO\n");
  exit(0);
}

This test program reproduces in one process what it takes the JVM two processes to synchronize: It makes use of asynchronous I/O through a pipe. Running this program natively on Linux and NetBSD gives different results. When we ran the program as a Linux binary on NetBSD, we got the NetBSD behavior and not the Linux behavior. This is where was the problem: NetBSD delivers a SIGIO on the read() call, whereas Linux does not. That is how the JVM got confused by the unexpected SIGIO.

At a glance, this may appear to be a bug in the way NetBSD handles asynchronous I/O. The writer has written 4 bytes to the pipe, and it is not blocked on the write operation. Thus, there is no reason why it needs to know that the reader has read one byte. Linux seems to implement a better behavior here.

In fact, this is not a real bug because it seems that there is no standard standard such as POSIX available that explains how asynchronous I/O is supposed to work. And with no such standard available, there can't be a "standard" behavior, and thus, this isn't considered a bug in the way NetBSD handles asynchronous I/O for native programs. Of course, there is still a bug in the way NetBSD was emulating asynchronous I/O for Linux binaries.

The lack of a standard is reflected by the diversity of the behaviors implemented by the different Unix systems. Running the test program on eight different operating systems gives the following results:

The systems triggering SIGIO on read() for pipes are in fact the Unix systems that still use the original pipe implementation from Berkley's BSD Unix, which is based on a pair of Unix domain sockets. Solaris uses an AT&T's Unix System V implementation that does not implement asynchronous I/O on pipes, and Linux has an implementation written from scratch that also ignores asynchronous I/O requests on pipes.

Digital Unix and MacOS X both have strong BSD roots, and it is not surprising that they behave in the same way that NetBSD does. What is surprising is that the two operating systems that are the closest to NetBSD, that is, FreeBSD and OpenBSD, implement a different behavior. This is just because they both use a new optimized pipe implementation, written by John Dyson for the FreeBSD project. This new implementation does not implement asynchronous I/O on read() operations for pipes. FreeBSD pipes are currently being integrated in NetBSD, and using the new pipe implementation leads to the same behavior that Linux and other OSes have.

Once the problem has been identified, it is time to propose a fix. Let's have a look to the way the SIGIO is issued.

When a process makes a read() system call, the kernel runs the sys_read() function, which is located in sys/kern/sys_generic.c. sys_read() in turn calls dofileread() from the same file.

dofileread() invokes the function pointed at by the fo_read field of the file operation structure. This file operation structure, struct fileops, is defined in struct file, in sys/sys/file.h. For pipes, it is initialized when the pipe() system call is called. pipe() is implemented in the kernel as sys_pipe(), which is located in the sys/kern/uipc_syscalls.c file. sys_pipe() sets the file operation structure for pipes to a static fileops structure called "socketops."

The underlying idea of the fileops structure is to use an object-oriented scheme for file handling. All files are handled the same way by the kernel, through the standard methods available in the struct fileops: read, write, ioctl, etc. The struct fileops is initialized when the file object (that is, the struct file) is created, and the methods in it depend on the file type. A read operation on a pipe, a plain file, or a block device are hence requested the same way, but implemented by different functions.

Although this is not really related to the compatibility subsystem, it is probably worth mentioning that this scheme is widely used in the Unix kernel. The most popular application is probably the Virtual File System (VFS) interface. The VFS uses pointers to functions to provide the same programming interface to access regular files and directories stored on various filesystem types. The operations are pointed by the v_op field of the struct vnode (defined in sys/sys/vnode.h). Depending if the regular file is on an FFS, NFS, or NTFS filesystem, file operations are requested the same way through pointers to filesystem-specific methods, and the operations are implemented by different functions depending on the filesystem.

But let's come back to pipe file operations. socketops contains the file operations functions for all sockets. It is defined in sys/kern/sys_socket.c, and its fo_read field is a pointer to the soo_read() function, located in sys/kern/sys_socket.c. soo_read invokes a function pointed to by the so_receive field of the struct socket (defined in sys/sys/socket.h) defining the receive method for the socket on which we want to read (remember pipes are implemented as Unix domain sockets in NetBSD).

We need to make one more long journey into the kernel sources to find out where so_receive is pointing. In sys_pipe(), we can see that the kernel is creating two Unix sockets to build the pipe. It does this by invoking socreate(), which is located in sys/kern/uipc_socket.c. In this function, there is some black magic to set the so_receive field. Its value is copied from the pr_usrreq field of a struct protosw variable. struct protosw is defined in sys/sys/protosw.h. It defines per protocol properties for sockets. The struct protosw variable used by socreate() is obtained by a call to pffindproto(). pffindproto() can be found in sys/kern/uipc_domain.c and its job is to return the struct protosw for a given protocol.

The protosw structures are statically initialized in sys/kern/uipc.c. For a Unix socket, the pr_usrreq field is pointing to [XXX] sys/kern/uipc_usrreq.c:uipc_usrreq(). Now we finally know that so_receive is pointing to uipc_usrreq().

uipc_usrreq() is responsible for dispatching various sockets operations: receiving, sending, connecting, and so on. On receive operation (case PRU_RCVD in the function), it ends by calling sowwakeup(), which is a macro defined in sys/sys/socketvar.h, and which calls sowakeup() in sys/kern/uipc_socket2.c. sowakeup()'s job is to wake up the peer process, issue a SIGIO and make any appropriate upcall.

Modifying something in uipc_usrreq() is not a good idea, it is complex enough. Care should be taken to fold in our fix somewhere else. In fact, the easiest way of fixing the problem would just be to ignore asynchronous I/O requests for binaries of operating systems that do not implement it for pipes. This fix would take place in the fcntl() implementation. Let's have a look at the kernel sources.

sys_fcntl() is implemented in sys/kern/kern_descrip.c. It basically calls the function pointed by the fo_ioctl of the struct fileops of the underlying object. Here, it is a Unix socket, and we saw the struct fileops was implemented as the socketops static variable defined in sys/kern/sys_socket.c. Thus, fo_ioctl points to soo_ioctl(), which is also defined in sys/kern/sys_socket.c. To request asynchronous I/O, the calling process calls fnctl() with the FIOASYNC command. In soo_ioctl(), the FIOASYNC command was implemented like this:

case FIOASYNC:
    if (*(int *)data)) {
      so->so_state |= SS_ASYNC;
      so->so_rcv.sb_flags |= SB_ASYNC;
      so->so_snd.sb_flags |= SB_ASYNC;
   } else {
      so->so_state &= ~SS_ASYNC;
      so->so_rcv.sb_flags &= ~SB_ASYNC;
      so->so_snd.sb_flags &= ~SB_ASYNC;
   }
return (0);

We wanted to prevent soo_ioctl() from setting asynchronous flags when the socket was in fact a pipe and when the emulation was not NetBSD or Digital Unix. (There is no MacOS X emulation yet.) To achieve this, we needed to recognize sockets implementing a pipe. This was done by adding a SS_ISAPIPE flag to the so_flags field of struct socket. SS_ISAPIPE is defined in sys/sys/socketvar.h:

#define  SS_ISAPIPE     0x800 /* socket is implementing a pipe */

This flag is set in sys_pipe(), in sys/kern/uipc_syscalls.c so that we will be able to tell that this socket is a pipe:

if ((error = socreate(AF_LOCAL, &rso, SOCK_STREAM, 0)) != 0)
   return (error);
if ((error = socreate(AF_LOCAL, &wso, SOCK_STREAM, 0)) != 0)
   goto free1;
/* remember this socket pair implements a pipe */
wso->so_state |= SS_ISAPIPE;
rso->so_state |= SS_ISAPIPE;

Then we needed to know if a given emulation required the original BSD behavior for pipes or not. This was done by introducing another new flag, this time in the e_flags field of struct emul, which is defined in sys/sys/proc.h:

/*
 * No BSD style async I/O pipes. Aync I/O request through
 * fcntl() for pipes will be ignored.
 */
#define  EMUL_NO_BSD_ASYNCIO_PIPE   0x002

This flag is enabled or not in the struct emul definition for each OS. For NetBSD native, the struct emulsw is called emul_netbsd, and it is initialized in in sys/kern/kern_exec.c. For Linux, it is emul_linux, initialized in sys/compat/linux/common/linux_exec.c, and so on, the scheme is similar for other emulations.

With theses two additional flags, we can now do the job, and we end up with this implementation of FIOASYNC in soo_ioctl():

case FIOASYNC:
   if (
#ifndef __HAVE_MINIMAL_EMUL
     (!(so->so_state & SS_ISAPIPE) ||
     (!(p->p_emul->e_flags & EMUL_NO_BSD_ASYNCIO_PIPE))) &&
#endif
     *(int *)data) {
       so->so_state |= SS_ASYNC;
       so->so_rcv.sb_flags |= SB_ASYNC;
       so->so_snd.sb_flags |= SB_ASYNC;
    } else {
       so->so_state &= ~SS_ASYNC;
       so->so_rcv.sb_flags &= ~SB_ASYNC;
       so->so_snd.sb_flags &= ~SB_ASYNC;
    }
 return (0);

The __HAVE_MINIMAL_EMUL ifdef is here because the e_flags field in struct emul is also in a __HAVE_MINIMAL_EMUL ifdef.

With this implementation, the pipe behavior was fixed for Linux and Solaris binaries, and probably other emulations as well. This fixed our problem with Jakarta-Ant build, and it greatly improved the usability of Jakarta-Tomcat, because it was then able to work with JDK-1.3.0 and Green Threads. Thanks to Linux emulation, it is now possible to play with servlets and JSP on NetBSD/PowerPC.

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.

Previously in this series

Linux Compatibility on BSD for the PPC Platform -- The Linux compatibility layer allows BSD to run Linux binary applications. Emmanuel Dreyfus explains how he implemented this on NetBSD for the PowerPC.

Linux Compatibility on BSD for the PPC Platform: Part 2 -- Emmanuel Dreyfus takes a look at how to prevent dynamic Linux binary compatibility problems on the NetBSD/PowerPC platform.

Linux Compatibility on BSD for the PPC Platform: Part 3 -- Signals are the interactions between the kernel and the user program -- a program can't run without them. Emmanuel Dreyfus explains how to make your signals Linux-compatible.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.