summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLuke Shumaker <lukeshu@sbcglobal.net>2016-02-28 06:55:56 -0500
committerLuke Shumaker <lukeshu@sbcglobal.net>2016-02-28 06:55:56 -0500
commit9ea8568d0c333916833bee1eaf021334090486fd (patch)
treed2dd2c12b156b09909e376f89ad7a69a155725c3
parent0eb83b19ed075a07b86549d2938c4224ca1d5df6 (diff)
write a revised version of the Java segfault article
-rw-r--r--public/java-segfault-redux.md228
-rw-r--r--public/java-segfault.md16
2 files changed, 238 insertions, 6 deletions
diff --git a/public/java-segfault-redux.md b/public/java-segfault-redux.md
new file mode 100644
index 0000000..cc4254d
--- /dev/null
+++ b/public/java-segfault-redux.md
@@ -0,0 +1,228 @@
+My favorite bug: segfaults in Java (redux)
+==========================================
+---
+date: "2016-02-28"
+---
+
+Two years ago, I [wrote][original] about one of my favorite bugs that
+I'd squashed two years before that. About a year after that, someone
+posted it [on Hacker News][HN]
+
+There was some fun discussion about it, but also some confusion.
+After finishing a season of mentoring team 4272, I've decided that it
+would be fun to re-visit the article, and dig up the old actual code,
+instead of pseudo-code, hopefully improving the clarity (and providing
+a light introduction for anyone wanting to get into modifying the
+current SmartDashbaord).
+
+The context
+-----------
+
+In 2012, I was a high school senior, and lead programmer programmer on
+the FIRST Robotics Competition team 1024. For the unfamiliar, the
+relevant part of the setup is that there are 2 minute and 15 second
+matches in which you have a 120 pound robot that sometimes runs
+autonomously, and sometimes is controlled over WiFi from a person at a
+laptop running stock "driver station" software and modifiable
+"dashboard" software.
+
+That year, we mostly used the dashboard software to allow the human
+driver and operator to monitor sensors on the robot, one of them being
+a video feed from a web-cam mounted on it. This was really easy
+because the new standard dashboard program had a click-and drag
+interface to add stock widgets; you just had to make sure the code on
+the robot was actually sending the data.
+
+That's great, until when debugging things, the dashboard would
+suddenly vanish. If it was run manually from a terminal (instead of
+letting the driver station software launch it), you would see a core
+dump indicating a segmentation fault.
+
+This wasn't just us either; I spoke with people on other teams,
+everyone who was streaming video had this issue. But, because it only
+happened every couple of minutes, and a match is only 2:15, it didn't
+need to run very long, they just crossed their fingers and hoped it
+didn't happen during a match.
+
+The dashboard was written in Java, and the source was available (under
+a 3-clause BSD license) via read-only SVN at
+`http://firstforge.wpi.edu/svn/repos/smart_dashboard/trunk` (which is
+unfortunately no longer online, fortunately I'd posted some snapshots
+on the web). So I dove in, hunting for the bug.
+
+The repository was divided into several NetBeans projects (not
+exhaustively listed):
+
+ - [`client/smartdashboard`][svn/smartdashboard]: The main dashboard
+ program, has a plugin architecture.
+ - [`WPIJavaCV`][svn/wpijavacv]: A higher-level wrapper around JavaCV,
+ itself a Java Native Interface (JNI) wrapper to talk to OpenCV (C
+ and C++).
+ - [`extensions/camera/WPICameraExtension`][svn/wpicameraextension]:
+ The standard camera feed plugin, processes the video through
+ WPIJavaCV.
+
+I figured that the bug must be somewhere in the C or C++ code that was
+being called by JavaCV, because that's the language where segfaults
+happen. It was especially a pain to track down the pointers that were
+causing the issue, because it was hard with native debuggers to see
+through all of the JVM stuff to the OpenCV code, and the OpenCV stuff
+is opaque to Java debuggers.
+
+Eventually the issue lead me back into the WPICameraExtension, then
+into WPIJavaCV--there was a native pointer being stored in a Java
+variable; Java code called the native routine to `free()` the
+structure, but then tried to feed it to another routine later. This
+lead to difficulty again--tracking objects with Java debuggers was
+hard because they don't expect the program to suddenly segfault; it's
+Java code, Java doesn't segfault, it throws exceptions!
+
+With the help of `println()` I was eventually able to see that some
+code was executing in an order that straight didn't make sense.
+
+The bug
+-------
+
+The basic flow of WPIJavaCV is you have a `WPICamera`, and you call
+`.getNewImage()` on it, which gives you a `WPIImage`, which you could
+do all kinds of fancy OpenCV things on, but then ultimately call
+`.getBufferedImage()`, which gives you a
+`java.awt.image.BufferedImage` that you can pass to Swing to draw on
+the screen. You do this every for frame. Which is exactly what
+`WPICameraExtension.java` did, except that "all kinds of fancy OpenCV
+things" consisted only of:
+
+ public WPIImage processImage(WPIColorImage rawImage) {
+ return rawImage;
+ }
+
+The idea was that you would extend the class, overriding that one
+method, if you wanted to do anything fancy.
+
+One of the neat things about WPIJavaCV was that every OpenCV object
+class extended had a `finalize()` method (via inheriting from the
+abstract class `WPIDisposable`) that freed the underlying C/C++
+memory, so you didn't have to worry about memory leaks like in plain
+JavaCV. To inherit from `WPIDisposable`, you had to write a
+`disposed()` method that actually freed the memory. This was better
+than writing `finalize()` directly, because it did some safety with
+NULL pointers and idempotency if you wanted to manually free something
+early.
+
+Now, `edu.wpi.first.WPIImage.disposed()` called
+<code>[com.googlecode.javacv.cpp.opencv_core.IplImage][].release()</code>,
+which called (via JNI) `IplImage:::release()`, which called libc
+`free()`:
+
+ @Override
+ protected void disposed() {
+ image.release();
+ }
+
+Elsewhere, the C buffer for the image was copied into a Java buffer
+via a similar chain kicked off by
+`edu.wpi.first.WPIImage.getBufferedImage()`:
+
+ /**
+ * Copies this {@link WPIImage} into a {@link BufferedImage}.
+ * This method will always generate a new image.
+ * @return a copy of the image
+ */
+ public BufferedImage getBufferedImage() {
+ validateDisposed();
+
+ return image.getBufferedImage();
+ }
+
+The `println()` output I saw that didn't make sense was that
+`someFrame.finalize()` was running before
+`someFrame.getBuffereImage()` had returned!
+
+You see, if it is waiting for the return value of a method `m()` of
+object `a`, and code in `m()` that is yet to be executed doesn't
+access any other methods or properties of `a`, then it will go ahead
+and consider `a` eligible for garbage collection before `m()` has
+finished running.
+
+Put another way, `this` is passed to a method just like any other
+argument. If a method is done accessing `this`, then it's "safe" for
+the JVM to go ahead and garbage collect it.
+
+That is normally a safe "optimization" to makeā€¦ except for when a
+destructor method (`finalize()`) is defined for the object; the
+destructor can have side effects, and Java has no way to know whether
+it is safe for them to happen before `m()` has finished running.
+
+I'm not entirely sure if this is a "bug" in the compiler or the
+language specification, but I do believe that it's broken behavior.
+
+Anyway, in this case it's unsafe with WPI's code.
+
+My work-around
+---------------
+
+My work-around was to change this function in `WPIImage`:
+
+ public BufferedImage getBufferedImage() {
+ validateDisposed();
+
+ return image.getBufferedImage(); // `this` may get garbage collected before it returns!
+ }
+
+In the above code, `this` is a `WPIImage`, and it may get garbage
+collected between the time that `image.getBufferedImage()` is
+dispatched, and the time that `image.getBufferedImage()` accesses
+native memory. When it is garbage collected, it calls
+`image.release()`, which `free()`s that native memory. That seems
+pretty unlikely to happen; that's a very small gap of time. However,
+running 30 times a second, eventually bad luck with the garbage
+collector happens, and the program crashes.
+
+The work-around was to insert a bogus call to this to keep `this`
+around until after we were also done with `image`:
+
+to this:
+
+ public BufferedImage getBufferedImage() {
+ validateDisposed();
+ BufferedImage ret = image.getBufferedImage();
+ getWidth(); // bogus call to keep `this` around
+ return ret;
+ }
+
+Yeah. After spending weeks wading through though thousands of lines
+of Java, C, and C++, a bogus call to a method I didn't care about was
+the fix.
+
+TheLoneWolfling on Hacker News noted that they'd be worried about the
+JVM optimizing out the call to `getWidth()`. I'm not, because
+`WPIImage.getWidth()` calls `IplImage.width()`, which is declared as
+`native`; the JVM must run it because it might have side effects. On
+the other hand, looking back, I think I just shrunk the window for
+things to go wrong: it may be possible for the garbage collection to
+trigger in the time between `getWidth()` being dispatched and
+`width()` running. Perhaps there was something in the C/C++ code that
+made it safe, I don't recall, and don't care quite enough to dig into
+OpenCV internals again. Or perhaps I'm mis-remembering the fix (which
+I don't actually have a file of), and I called some other method that
+_could_ get optimized out (though I _do_ believe that it was either
+`getWidth()` or `getHeight()`).
+
+WPI's fix
+---------
+
+Four years later, the SmartDashboard is still being used! But it no
+longer has this bug, and it's not using my workaround. So, how did
+the WPILib developers fix it?
+
+Well, the code now lives [in git at collab.net][git], so I decided to
+take a look. They put a mutex around all accesses to to `image`,
+which should have been the obvious solution to me.
+
+[original]: ./java-segfault.html
+[HN]: https://news.ycombinator.com/item?id=9283571
+[com.googlecode.javacv.cpp.opencv_core.IplImage]: https://github.com/bytedeco/javacv/blob/svn/src/com/googlecode/javacv/cpp/opencv_core.java#L398
+[svn/smartdashboard]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-client-2012-1-any.src.tar.xz;hb=HEAD
+[svn/wpijavacv]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=wpijavacv-208-1-any.src.tar.xz;hb=HEAD
+[svn/wpicameraextension]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-extension-wpicameraextension-210-1-any.src.tar.xz;hb=HEAD
+[git]: https://usfirst.collab.net/gerrit/#/admin/projects/
diff --git a/public/java-segfault.md b/public/java-segfault.md
index 0801d60..295ef48 100644
--- a/public/java-segfault.md
+++ b/public/java-segfault.md
@@ -4,6 +4,10 @@ My favorite bug: segfaults in Java
date: "2014-01-13"
---
+> Update: Two years later, I wrote a more detailed version of this
+> article:
+> [My favorite bug: segfaults in Java (redux)](./java-segfault-redux.html).
+
I've told this story orally a number of times, but realized that I
have never written it down. This is my favorite bug story; it might
not be my hardest bug, but it is the one I most like to tell.
@@ -18,12 +22,12 @@ robot that sometimes runs autonomously, and sometimes is controlled
over WiFi from a person at a laptop running stock "driver station"
software and modifiable "dashboard" software.
-That year, we mostly used the dashboard software to allow the monitor
-sensors on the robot, one of them being a video feed from a web-cam
-mounted on it. This was really easy because the new standard
-dashboard program had a click-and drag interface to add stock widgets;
-you just had to make sure the code on the robot was actually sending
-the data.
+That year, we mostly used the dashboard software to allow the human
+driver and operator to monitor sensors on the robot, one of them being
+a video feed from a web-cam mounted on it. This was really easy
+because the new standard dashboard program had a click-and drag
+interface to add stock widgets; you just had to make sure the code on
+the robot was actually sending the data.
That's great, until when debugging things, the dashboard would
suddenly vanish. If it was run manually from a terminal (instead of