summaryrefslogtreecommitdiff
path: root/public/java-segfault-redux.html
diff options
context:
space:
mode:
Diffstat (limited to 'public/java-segfault-redux.html')
-rw-r--r--public/java-segfault-redux.html219
1 files changed, 219 insertions, 0 deletions
diff --git a/public/java-segfault-redux.html b/public/java-segfault-redux.html
new file mode 100644
index 0000000..e111f5e
--- /dev/null
+++ b/public/java-segfault-redux.html
@@ -0,0 +1,219 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+ <meta charset="utf-8">
+ <title>My favorite bug: segfaults in Java (redux) — Luke T. Shumaker</title>
+ <meta name="viewport" content="width=device-width, initial-scale=1">
+ <link rel="stylesheet" href="assets/style.css">
+ <link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/>
+</head>
+<body>
+<header><a href="/">Luke T. Shumaker</a> » <a href=/blog>blog</a> » java-segfault-redux</header>
+<article>
+<h1 id="my-favorite-bug-segfaults-in-java-redux">My favorite bug:
+segfaults in Java (redux)</h1>
+<p>Two years ago, I <a href="./java-segfault.html">wrote</a> about one
+of my favorite bugs that I’d squashed two years before that. About a
+year after that, someone posted it <a
+href="https://news.ycombinator.com/item?id=9283571">on Hacker
+News</a>.</p>
+<p>There was some fun discussion about it, but also some confusion.
+After finishing a season of mentoring team 4272, I’ve decided that it
+would be fun to re-visit the article, and dig up the old actual code,
+instead of pseudo-code, hopefully improving the clarity (and providing a
+light introduction for anyone wanting to get into modifying the current
+SmartDashbaord).</p>
+<h2 id="the-context">The context</h2>
+<p>In 2012, I was a high school senior, and lead programmer programmer
+on the FIRST Robotics Competition team 1024. For the unfamiliar, the
+relevant part of the setup is that there are 2 minute and 15 second
+matches in which you have a 120 pound robot that sometimes runs
+autonomously, and sometimes is controlled over WiFi from a person at a
+laptop running stock “driver station” software and modifiable
+“dashboard” software.</p>
+<p>That year, we mostly used the dashboard software to allow the human
+driver and operator to monitor sensors on the robot, one of them being a
+video feed from a web-cam mounted on it. This was really easy because
+the new standard dashboard program had a click-and drag interface to add
+stock widgets; you just had to make sure the code on the robot was
+actually sending the data.</p>
+<p>That’s great, until when debugging things, the dashboard would
+suddenly vanish. If it was run manually from a terminal (instead of
+letting the driver station software launch it), you would see a core
+dump indicating a segmentation fault.</p>
+<p>This wasn’t just us either; I spoke with people on other teams,
+everyone who was streaming video had this issue. But, because it only
+happened every couple of minutes, and a match is only 2:15, it didn’t
+need to run very long, they just crossed their fingers and hoped it
+didn’t happen during a match.</p>
+<p>The dashboard was written in Java, and the source was available
+(under a 3-clause BSD license) via read-only SVN at
+<code>http://firstforge.wpi.edu/svn/repos/smart_dashboard/trunk</code>
+(which is unfortunately no longer online, fortunately I’d posted some
+snapshots on the web). So I dove in, hunting for the bug.</p>
+<p>The repository was divided into several NetBeans projects (not
+exhaustively listed):</p>
+<ul>
+<li><a
+href="https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-client-2012-1-any.src.tar.xz;hb=HEAD"><code>client/smartdashboard</code></a>:
+The main dashboard program, has a plugin architecture.</li>
+<li><a
+href="https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=wpijavacv-208-1-any.src.tar.xz;hb=HEAD"><code>WPIJavaCV</code></a>:
+A higher-level wrapper around JavaCV, itself a Java Native Interface
+(JNI) wrapper to talk to OpenCV (C and C++).</li>
+<li><a
+href="https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-extension-wpicameraextension-210-1-any.src.tar.xz;hb=HEAD"><code>extensions/camera/WPICameraExtension</code></a>:
+The standard camera feed plugin, processes the video through
+WPIJavaCV.</li>
+</ul>
+<p>I figured that the bug must be somewhere in the C or C++ code that
+was being called by JavaCV, because that’s the language where segfaults
+happen. It was especially a pain to track down the pointers that were
+causing the issue, because it was hard with native debuggers to see
+through all of the JVM stuff to the OpenCV code, and the OpenCV stuff is
+opaque to Java debuggers.</p>
+<p>Eventually the issue lead me back into the WPICameraExtension, then
+into WPIJavaCV—there was a native pointer being stored in a Java
+variable; Java code called the native routine to <code>free()</code> the
+structure, but then tried to feed it to another routine later. This lead
+to difficulty again—tracking objects with Java debuggers was hard
+because they don’t expect the program to suddenly segfault; it’s Java
+code, Java doesn’t segfault, it throws exceptions!</p>
+<p>With the help of <code>println()</code> I was eventually able to see
+that some code was executing in an order that straight didn’t make
+sense.</p>
+<h2 id="the-bug">The bug</h2>
+<p>The basic flow of WPIJavaCV is you have a <code>WPICamera</code>, and
+you call <code>.getNewImage()</code> on it, which gives you a
+<code>WPIImage</code>, which you could do all kinds of fancy OpenCV
+things on, but then ultimately call <code>.getBufferedImage()</code>,
+which gives you a <code>java.awt.image.BufferedImage</code> that you can
+pass to Swing to draw on the screen. You do this every for frame. Which
+is exactly what <code>WPICameraExtension.java</code> did, except that
+“all kinds of fancy OpenCV things” consisted only of:</p>
+<pre><code>public WPIImage processImage(WPIColorImage rawImage) {
+ return rawImage;
+}</code></pre>
+<p>The idea was that you would extend the class, overriding that one
+method, if you wanted to do anything fancy.</p>
+<p>One of the neat things about WPIJavaCV was that every OpenCV object
+class extended had a <code>finalize()</code> method (via inheriting from
+the abstract class <code>WPIDisposable</code>) that freed the underlying
+C/C++ memory, so you didn’t have to worry about memory leaks like in
+plain JavaCV. To inherit from <code>WPIDisposable</code>, you had to
+write a <code>disposed()</code> method that actually freed the memory.
+This was better than writing <code>finalize()</code> directly, because
+it did some safety with NULL pointers and idempotency if you wanted to
+manually free something early.</p>
+<p>Now, <code>edu.wpi.first.WPIImage.disposed()</code> called <code><a
+href="https://github.com/bytedeco/javacv/blob/svn/src/com/googlecode/javacv/cpp/opencv_core.java#L398">com.googlecode.javacv.cpp.opencv_core.IplImage</a>.release()</code>,
+which called (via JNI) <code>IplImage:::release()</code>, which called
+libc <code>free()</code>:</p>
+<pre><code>@Override
+protected void disposed() {
+ image.release();
+}</code></pre>
+<p>Elsewhere, the C buffer for the image was copied into a Java buffer
+via a similar chain kicked off by
+<code>edu.wpi.first.WPIImage.getBufferedImage()</code>:</p>
+<pre><code>/**
+ * Copies this {@link WPIImage} into a {@link BufferedImage}.
+ * This method will always generate a new image.
+ * @return a copy of the image
+ */
+public BufferedImage getBufferedImage() {
+ validateDisposed();
+
+ return image.getBufferedImage();
+}</code></pre>
+<p>The <code>println()</code> output I saw that didn’t make sense was
+that <code>someFrame.finalize()</code> was running before
+<code>someFrame.getBuffereImage()</code> had returned!</p>
+<p>You see, if it is waiting for the return value of a method
+<code>m()</code> of object <code>a</code>, and code in <code>m()</code>
+that is yet to be executed doesn’t access any other methods or
+properties of <code>a</code>, then it will go ahead and consider
+<code>a</code> eligible for garbage collection before <code>m()</code>
+has finished running.</p>
+<p>Put another way, <code>this</code> is passed to a method just like
+any other argument. If a method is done accessing <code>this</code>,
+then it’s “safe” for the JVM to go ahead and garbage collect it.</p>
+<p>That is normally a safe “optimization” to make… except for when a
+destructor method (<code>finalize()</code>) is defined for the object;
+the destructor can have side effects, and Java has no way to know
+whether it is safe for them to happen before <code>m()</code> has
+finished running.</p>
+<p>I’m not entirely sure if this is a “bug” in the compiler or the
+language specification, but I do believe that it’s broken behavior.</p>
+<p>Anyway, in this case it’s unsafe with WPI’s code.</p>
+<h2 id="my-work-around">My work-around</h2>
+<p>My work-around was to change this function in
+<code>WPIImage</code>:</p>
+<pre><code>public BufferedImage getBufferedImage() {
+ validateDisposed();
+
+ return image.getBufferedImage(); // `this` may get garbage collected before it returns!
+}</code></pre>
+<p>In the above code, <code>this</code> is a <code>WPIImage</code>, and
+it may get garbage collected between the time that
+<code>image.getBufferedImage()</code> is dispatched, and the time that
+<code>image.getBufferedImage()</code> accesses native memory. When it is
+garbage collected, it calls <code>image.release()</code>, which
+<code>free()</code>s that native memory. That seems pretty unlikely to
+happen; that’s a very small gap of time. However, running 30 times a
+second, eventually bad luck with the garbage collector happens, and the
+program crashes.</p>
+<p>The work-around was to insert a bogus call to this to keep
+<code>this</code> around until after we were also done with
+<code>image</code>:</p>
+<p>to this:</p>
+<pre><code>public BufferedImage getBufferedImage() {
+ validateDisposed();
+ BufferedImage ret = image.getBufferedImage();
+ getWidth(); // bogus call to keep `this` around
+ return ret;
+}</code></pre>
+<p>Yeah. After spending weeks wading through though thousands of lines
+of Java, C, and C++, a bogus call to a method I didn’t care about was
+the fix.</p>
+<p>TheLoneWolfling on Hacker News noted that they’d be worried about the
+JVM optimizing out the call to <code>getWidth()</code>. I’m not, because
+<code>WPIImage.getWidth()</code> calls <code>IplImage.width()</code>,
+which is declared as <code>native</code>; the JVM must run it because it
+might have side effects. On the other hand, looking back, I think I just
+shrunk the window for things to go wrong: it may be possible for the
+garbage collection to trigger in the time between
+<code>getWidth()</code> being dispatched and <code>width()</code>
+running. Perhaps there was something in the C/C++ code that made it
+safe, I don’t recall, and don’t care quite enough to dig into OpenCV
+internals again. Or perhaps I’m mis-remembering the fix (which I don’t
+actually have a file of), and I called some other method that
+<em>could</em> get optimized out (though I <em>do</em> believe that it
+was either <code>getWidth()</code> or <code>getHeight()</code>).</p>
+<h2 id="wpis-fix">WPI’s fix</h2>
+<p>Four years later, the SmartDashboard is still being used! But it no
+longer has this bug, and it’s not using my workaround. So, how did the
+WPILib developers fix it?</p>
+<p>Well, the code now lives <a
+href="https://usfirst.collab.net/gerrit/#/admin/projects/">in git at
+collab.net</a>, so I decided to take a look.</p>
+<p>The stripped out WPIJavaCV from the main video feed widget, and now
+use a purely Java implementation of MPJPEG streaming.</p>
+<p>However, the old video feed widget is still available as an extension
+(so that you can still do cool things with <code>processImage</code>),
+and it also no longer has this bug. Their fix was to put a mutex around
+all accesses to <code>image</code>, which should have been the obvious
+solution to me.</p>
+
+</article>
+<footer>
+ <aside class="sponsor"><p>I'd love it if you <a class="em"
+ href="/sponsor/">sponsored me</a>. It will allow me to continue
+ <a class="em" href="/imworkingon/">my work</a> on the GNU/Linux
+ ecosystem. Thanks!</p></aside>
+
+<p>The content of this page is Copyright © 2016 <a href="mailto:lukeshu@lukeshu.com">Luke T. Shumaker</a>.</p>
+<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a> license.</p>
+</footer>
+</body>
+</html>