From 9ea8568d0c333916833bee1eaf021334090486fd Mon Sep 17 00:00:00 2001 From: Luke Shumaker Date: Sun, 28 Feb 2016 06:55:56 -0500 Subject: write a revised version of the Java segfault article --- public/java-segfault-redux.md | 228 ++++++++++++++++++++++++++++++++++++++++++ public/java-segfault.md | 16 +-- 2 files changed, 238 insertions(+), 6 deletions(-) create mode 100644 public/java-segfault-redux.md diff --git a/public/java-segfault-redux.md b/public/java-segfault-redux.md new file mode 100644 index 0000000..cc4254d --- /dev/null +++ b/public/java-segfault-redux.md @@ -0,0 +1,228 @@ +My favorite bug: segfaults in Java (redux) +========================================== +--- +date: "2016-02-28" +--- + +Two years ago, I [wrote][original] about one of my favorite bugs that +I'd squashed two years before that. About a year after that, someone +posted it [on Hacker News][HN] + +There was some fun discussion about it, but also some confusion. +After finishing a season of mentoring team 4272, I've decided that it +would be fun to re-visit the article, and dig up the old actual code, +instead of pseudo-code, hopefully improving the clarity (and providing +a light introduction for anyone wanting to get into modifying the +current SmartDashbaord). + +The context +----------- + +In 2012, I was a high school senior, and lead programmer programmer on +the FIRST Robotics Competition team 1024. For the unfamiliar, the +relevant part of the setup is that there are 2 minute and 15 second +matches in which you have a 120 pound robot that sometimes runs +autonomously, and sometimes is controlled over WiFi from a person at a +laptop running stock "driver station" software and modifiable +"dashboard" software. + +That year, we mostly used the dashboard software to allow the human +driver and operator to monitor sensors on the robot, one of them being +a video feed from a web-cam mounted on it. This was really easy +because the new standard dashboard program had a click-and drag +interface to add stock widgets; you just had to make sure the code on +the robot was actually sending the data. + +That's great, until when debugging things, the dashboard would +suddenly vanish. If it was run manually from a terminal (instead of +letting the driver station software launch it), you would see a core +dump indicating a segmentation fault. + +This wasn't just us either; I spoke with people on other teams, +everyone who was streaming video had this issue. But, because it only +happened every couple of minutes, and a match is only 2:15, it didn't +need to run very long, they just crossed their fingers and hoped it +didn't happen during a match. + +The dashboard was written in Java, and the source was available (under +a 3-clause BSD license) via read-only SVN at +`http://firstforge.wpi.edu/svn/repos/smart_dashboard/trunk` (which is +unfortunately no longer online, fortunately I'd posted some snapshots +on the web). So I dove in, hunting for the bug. + +The repository was divided into several NetBeans projects (not +exhaustively listed): + + - [`client/smartdashboard`][svn/smartdashboard]: The main dashboard + program, has a plugin architecture. + - [`WPIJavaCV`][svn/wpijavacv]: A higher-level wrapper around JavaCV, + itself a Java Native Interface (JNI) wrapper to talk to OpenCV (C + and C++). + - [`extensions/camera/WPICameraExtension`][svn/wpicameraextension]: + The standard camera feed plugin, processes the video through + WPIJavaCV. + +I figured that the bug must be somewhere in the C or C++ code that was +being called by JavaCV, because that's the language where segfaults +happen. It was especially a pain to track down the pointers that were +causing the issue, because it was hard with native debuggers to see +through all of the JVM stuff to the OpenCV code, and the OpenCV stuff +is opaque to Java debuggers. + +Eventually the issue lead me back into the WPICameraExtension, then +into WPIJavaCV--there was a native pointer being stored in a Java +variable; Java code called the native routine to `free()` the +structure, but then tried to feed it to another routine later. This +lead to difficulty again--tracking objects with Java debuggers was +hard because they don't expect the program to suddenly segfault; it's +Java code, Java doesn't segfault, it throws exceptions! + +With the help of `println()` I was eventually able to see that some +code was executing in an order that straight didn't make sense. + +The bug +------- + +The basic flow of WPIJavaCV is you have a `WPICamera`, and you call +`.getNewImage()` on it, which gives you a `WPIImage`, which you could +do all kinds of fancy OpenCV things on, but then ultimately call +`.getBufferedImage()`, which gives you a +`java.awt.image.BufferedImage` that you can pass to Swing to draw on +the screen. You do this every for frame. Which is exactly what +`WPICameraExtension.java` did, except that "all kinds of fancy OpenCV +things" consisted only of: + + public WPIImage processImage(WPIColorImage rawImage) { + return rawImage; + } + +The idea was that you would extend the class, overriding that one +method, if you wanted to do anything fancy. + +One of the neat things about WPIJavaCV was that every OpenCV object +class extended had a `finalize()` method (via inheriting from the +abstract class `WPIDisposable`) that freed the underlying C/C++ +memory, so you didn't have to worry about memory leaks like in plain +JavaCV. To inherit from `WPIDisposable`, you had to write a +`disposed()` method that actually freed the memory. This was better +than writing `finalize()` directly, because it did some safety with +NULL pointers and idempotency if you wanted to manually free something +early. + +Now, `edu.wpi.first.WPIImage.disposed()` called +[com.googlecode.javacv.cpp.opencv_core.IplImage][].release(), +which called (via JNI) `IplImage:::release()`, which called libc +`free()`: + + @Override + protected void disposed() { + image.release(); + } + +Elsewhere, the C buffer for the image was copied into a Java buffer +via a similar chain kicked off by +`edu.wpi.first.WPIImage.getBufferedImage()`: + + /** + * Copies this {@link WPIImage} into a {@link BufferedImage}. + * This method will always generate a new image. + * @return a copy of the image + */ + public BufferedImage getBufferedImage() { + validateDisposed(); + + return image.getBufferedImage(); + } + +The `println()` output I saw that didn't make sense was that +`someFrame.finalize()` was running before +`someFrame.getBuffereImage()` had returned! + +You see, if it is waiting for the return value of a method `m()` of +object `a`, and code in `m()` that is yet to be executed doesn't +access any other methods or properties of `a`, then it will go ahead +and consider `a` eligible for garbage collection before `m()` has +finished running. + +Put another way, `this` is passed to a method just like any other +argument. If a method is done accessing `this`, then it's "safe" for +the JVM to go ahead and garbage collect it. + +That is normally a safe "optimization" to makeā€¦ except for when a +destructor method (`finalize()`) is defined for the object; the +destructor can have side effects, and Java has no way to know whether +it is safe for them to happen before `m()` has finished running. + +I'm not entirely sure if this is a "bug" in the compiler or the +language specification, but I do believe that it's broken behavior. + +Anyway, in this case it's unsafe with WPI's code. + +My work-around +--------------- + +My work-around was to change this function in `WPIImage`: + + public BufferedImage getBufferedImage() { + validateDisposed(); + + return image.getBufferedImage(); // `this` may get garbage collected before it returns! + } + +In the above code, `this` is a `WPIImage`, and it may get garbage +collected between the time that `image.getBufferedImage()` is +dispatched, and the time that `image.getBufferedImage()` accesses +native memory. When it is garbage collected, it calls +`image.release()`, which `free()`s that native memory. That seems +pretty unlikely to happen; that's a very small gap of time. However, +running 30 times a second, eventually bad luck with the garbage +collector happens, and the program crashes. + +The work-around was to insert a bogus call to this to keep `this` +around until after we were also done with `image`: + +to this: + + public BufferedImage getBufferedImage() { + validateDisposed(); + BufferedImage ret = image.getBufferedImage(); + getWidth(); // bogus call to keep `this` around + return ret; + } + +Yeah. After spending weeks wading through though thousands of lines +of Java, C, and C++, a bogus call to a method I didn't care about was +the fix. + +TheLoneWolfling on Hacker News noted that they'd be worried about the +JVM optimizing out the call to `getWidth()`. I'm not, because +`WPIImage.getWidth()` calls `IplImage.width()`, which is declared as +`native`; the JVM must run it because it might have side effects. On +the other hand, looking back, I think I just shrunk the window for +things to go wrong: it may be possible for the garbage collection to +trigger in the time between `getWidth()` being dispatched and +`width()` running. Perhaps there was something in the C/C++ code that +made it safe, I don't recall, and don't care quite enough to dig into +OpenCV internals again. Or perhaps I'm mis-remembering the fix (which +I don't actually have a file of), and I called some other method that +_could_ get optimized out (though I _do_ believe that it was either +`getWidth()` or `getHeight()`). + +WPI's fix +--------- + +Four years later, the SmartDashboard is still being used! But it no +longer has this bug, and it's not using my workaround. So, how did +the WPILib developers fix it? + +Well, the code now lives [in git at collab.net][git], so I decided to +take a look. They put a mutex around all accesses to to `image`, +which should have been the obvious solution to me. + +[original]: ./java-segfault.html +[HN]: https://news.ycombinator.com/item?id=9283571 +[com.googlecode.javacv.cpp.opencv_core.IplImage]: https://github.com/bytedeco/javacv/blob/svn/src/com/googlecode/javacv/cpp/opencv_core.java#L398 +[svn/smartdashboard]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-client-2012-1-any.src.tar.xz;hb=HEAD +[svn/wpijavacv]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=wpijavacv-208-1-any.src.tar.xz;hb=HEAD +[svn/wpicameraextension]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-extension-wpicameraextension-210-1-any.src.tar.xz;hb=HEAD +[git]: https://usfirst.collab.net/gerrit/#/admin/projects/ diff --git a/public/java-segfault.md b/public/java-segfault.md index 0801d60..295ef48 100644 --- a/public/java-segfault.md +++ b/public/java-segfault.md @@ -4,6 +4,10 @@ My favorite bug: segfaults in Java date: "2014-01-13" --- +> Update: Two years later, I wrote a more detailed version of this +> article: +> [My favorite bug: segfaults in Java (redux)](./java-segfault-redux.html). + I've told this story orally a number of times, but realized that I have never written it down. This is my favorite bug story; it might not be my hardest bug, but it is the one I most like to tell. @@ -18,12 +22,12 @@ robot that sometimes runs autonomously, and sometimes is controlled over WiFi from a person at a laptop running stock "driver station" software and modifiable "dashboard" software. -That year, we mostly used the dashboard software to allow the monitor -sensors on the robot, one of them being a video feed from a web-cam -mounted on it. This was really easy because the new standard -dashboard program had a click-and drag interface to add stock widgets; -you just had to make sure the code on the robot was actually sending -the data. +That year, we mostly used the dashboard software to allow the human +driver and operator to monitor sensors on the robot, one of them being +a video feed from a web-cam mounted on it. This was really easy +because the new standard dashboard program had a click-and drag +interface to add stock widgets; you just had to make sure the code on +the robot was actually sending the data. That's great, until when debugging things, the dashboard would suddenly vanish. If it was run manually from a terminal (instead of -- cgit v1.2.3