From f54f0bfd764de0ae6e92a7a8a5d4f387db580ac1 Mon Sep 17 00:00:00 2001 From: Luke Shumaker Date: Mon, 13 Jan 2014 20:50:19 -0500 Subject: new post: My favorite bug: segfaults in Java --- public/java-segfault.md | 113 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 public/java-segfault.md diff --git a/public/java-segfault.md b/public/java-segfault.md new file mode 100644 index 0000000..e9ddf65 --- /dev/null +++ b/public/java-segfault.md @@ -0,0 +1,113 @@ +My favorite bug: segfaults in Java +================================== +--- +date: 2014-01-13 +--- + +I've told this story orally a number of times, but realized that I +have never written it down. This is my favorite bug story; it might +not be my hardest bug, but it is the one I am most proud of. + +The context +----------- + +In 2012, I was a Senior programmer on the FIRST Robotics Competition +team 1024. For the unfamiliar, the relevant part of the setup is that +there are 2 minute and 15 second matches in which you have a 120 pound +robot that sometimes runs autonomously, and sometimes is controlled +over WiFi from a person at a laptop running stock "driver station" +software and modifiable "dashboard" software. + +That year, we mostly used the dashboard software to allow the monitor +sensors on the robot, one of them being a video feed from a web-cam +mounted on it. This was really easy because the new standard +dashboard program had a click-and drag interface to add stock widgets; +you just had to make sure the code on the robot was actually sending +the data. + +That's great, until when debugging things, the dashboard would +suddenly vanish. If it was run manually from a terminal (instead of +letting the driver station software launch it), you would see a core +dump indicating a segmentation fault. + +This wasn't just us either; I spoke with people on other teams, +everyone who was streaming video had this issue. But, because it only +happened every couple of minutes, and a match is only 2:15, it didn't +need to run very long, they just crossed their fingers and hoped it +didn't happen during a match. + +The dashboard was written in Java, and the source was available (under +a 3-clause BSD license), so I dove in, hunting for the bug. Now, the +program did use Java Native Interface to talk to OpenCV, which the +video ran through; so I figured that it must be a bug in the C/C++ +code that was being called. It was especially a pain to track down +the pointers that were causing the issue, because it was hard with +native debuggers to see through all of the JVM stuff to the OpenCV +code, and the OpenCV stuff is opaque to Java debuggers. + +Eventually the issue lead me back into the Java code--there was a +native pointer being stored in a Java variable; Java code called the +native routine to `free()` the structure, but then tried to feed it to +another routine later. This lead to difficulty again--tracking +objects with Java debuggers was hard because they don't expect the +program to suddenly segfault; it's Java code, Java doesn't segfault, +it throws exceptions! + +With the help of `println()` I was eventually able to see that some +code was executing in an order that straight didn't make sense. + +The bug +------- + +The issue was that Java was making an unsafe optimization (I never +bothered to figure out if it is the compiler or the JVM making the +mistake, I was satisfied once I had a work-around). + +Java was doing something similar to tail-call optimization with regard +to garbage collection. You see, if it is waiting for the return value +of a method `m()` of object `o`, and code in `m()` that is yet to be +executed doesn't access any other methods or properties of `o`, then +it will go ahead and consider `o` eligible for garbage collection +before `m()` has finished running. + +That is normally a safe optimization to make... except for when a +destructor method (`finalize()`) is defined for the object; the +destructor can have side effects, and Java has no way to know whether +it is safe for them to happen before `m()` has finished running. + +The work-around +--------------- + +The routine that the segmentation fault was occurring in was something +like: + + public type1 getFrame() { + type2 child = this.getChild(); + type3 var = this.something(); + // `this` may now be garbage collected + return child.somethingElse(var); // segfault comes here + } + +Where the destructor method of `this` calls a method that will +`free()` native memory that is also accessed by `child`; if `this` is +garbage collected before `child.somethingElse()` runs, the backing +native code will try to access memory that has been `free()`ed, and +receive a segmentation fault. That usually didn't happen, as the +routines were pretty fast. However, running 30 times a second, +eventually bad luck with the garbage collector happens, and the +program crashes. + +The work-around was to insert a bogus call to this to keep `this` +around until after we were also done with `child`: + + public type1 getFrame() { + type2 child = this.getChild(); + type3 var = this.something(); + type1 ret = child.somethingElse(var); + this.getSize(); // bogus call to keep `this` around + return ret; + } + +Yeah. After spending weeks wading through though thousands of lines +of Java, C, and C++, a bogus call to a method I didn't care about was +the fix. -- cgit v1.2.3