summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLuke Shumaker <LukeShu@sbcglobal.net>2014-01-13 20:50:19 -0500
committerLuke Shumaker <LukeShu@sbcglobal.net>2014-01-13 20:50:19 -0500
commitf54f0bfd764de0ae6e92a7a8a5d4f387db580ac1 (patch)
tree8cd90620e6f1efad68cd02f2a326debaa43b0fe0
parent3117e385233579834f636b4b5a7bfb93b405d517 (diff)
new post: My favorite bug: segfaults in Java
-rw-r--r--public/java-segfault.md113
1 files changed, 113 insertions, 0 deletions
diff --git a/public/java-segfault.md b/public/java-segfault.md
new file mode 100644
index 0000000..e9ddf65
--- /dev/null
+++ b/public/java-segfault.md
@@ -0,0 +1,113 @@
+My favorite bug: segfaults in Java
+==================================
+---
+date: 2014-01-13
+---
+
+I've told this story orally a number of times, but realized that I
+have never written it down. This is my favorite bug story; it might
+not be my hardest bug, but it is the one I am most proud of.
+
+The context
+-----------
+
+In 2012, I was a Senior programmer on the FIRST Robotics Competition
+team 1024. For the unfamiliar, the relevant part of the setup is that
+there are 2 minute and 15 second matches in which you have a 120 pound
+robot that sometimes runs autonomously, and sometimes is controlled
+over WiFi from a person at a laptop running stock "driver station"
+software and modifiable "dashboard" software.
+
+That year, we mostly used the dashboard software to allow the monitor
+sensors on the robot, one of them being a video feed from a web-cam
+mounted on it. This was really easy because the new standard
+dashboard program had a click-and drag interface to add stock widgets;
+you just had to make sure the code on the robot was actually sending
+the data.
+
+That's great, until when debugging things, the dashboard would
+suddenly vanish. If it was run manually from a terminal (instead of
+letting the driver station software launch it), you would see a core
+dump indicating a segmentation fault.
+
+This wasn't just us either; I spoke with people on other teams,
+everyone who was streaming video had this issue. But, because it only
+happened every couple of minutes, and a match is only 2:15, it didn't
+need to run very long, they just crossed their fingers and hoped it
+didn't happen during a match.
+
+The dashboard was written in Java, and the source was available (under
+a 3-clause BSD license), so I dove in, hunting for the bug. Now, the
+program did use Java Native Interface to talk to OpenCV, which the
+video ran through; so I figured that it must be a bug in the C/C++
+code that was being called. It was especially a pain to track down
+the pointers that were causing the issue, because it was hard with
+native debuggers to see through all of the JVM stuff to the OpenCV
+code, and the OpenCV stuff is opaque to Java debuggers.
+
+Eventually the issue lead me back into the Java code--there was a
+native pointer being stored in a Java variable; Java code called the
+native routine to `free()` the structure, but then tried to feed it to
+another routine later. This lead to difficulty again--tracking
+objects with Java debuggers was hard because they don't expect the
+program to suddenly segfault; it's Java code, Java doesn't segfault,
+it throws exceptions!
+
+With the help of `println()` I was eventually able to see that some
+code was executing in an order that straight didn't make sense.
+
+The bug
+-------
+
+The issue was that Java was making an unsafe optimization (I never
+bothered to figure out if it is the compiler or the JVM making the
+mistake, I was satisfied once I had a work-around).
+
+Java was doing something similar to tail-call optimization with regard
+to garbage collection. You see, if it is waiting for the return value
+of a method `m()` of object `o`, and code in `m()` that is yet to be
+executed doesn't access any other methods or properties of `o`, then
+it will go ahead and consider `o` eligible for garbage collection
+before `m()` has finished running.
+
+That is normally a safe optimization to make... except for when a
+destructor method (`finalize()`) is defined for the object; the
+destructor can have side effects, and Java has no way to know whether
+it is safe for them to happen before `m()` has finished running.
+
+The work-around
+---------------
+
+The routine that the segmentation fault was occurring in was something
+like:
+
+ public type1 getFrame() {
+ type2 child = this.getChild();
+ type3 var = this.something();
+ // `this` may now be garbage collected
+ return child.somethingElse(var); // segfault comes here
+ }
+
+Where the destructor method of `this` calls a method that will
+`free()` native memory that is also accessed by `child`; if `this` is
+garbage collected before `child.somethingElse()` runs, the backing
+native code will try to access memory that has been `free()`ed, and
+receive a segmentation fault. That usually didn't happen, as the
+routines were pretty fast. However, running 30 times a second,
+eventually bad luck with the garbage collector happens, and the
+program crashes.
+
+The work-around was to insert a bogus call to this to keep `this`
+around until after we were also done with `child`:
+
+ public type1 getFrame() {
+ type2 child = this.getChild();
+ type3 var = this.something();
+ type1 ret = child.somethingElse(var);
+ this.getSize(); // bogus call to keep `this` around
+ return ret;
+ }
+
+Yeah. After spending weeks wading through though thousands of lines
+of Java, C, and C++, a bogus call to a method I didn't care about was
+the fix.