summaryrefslogtreecommitdiff
path: root/public/java-segfault-redux.html
blob: 41328df9e18925e01b5a6c122dacef3ca7b90324 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>My favorite bug: segfaults in Java (redux) — Luke T. Shumaker</title>
  <link rel="stylesheet" href="assets/style.css">
  <link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/>
</head>
<body>
<header><a href="/">Luke T. Shumaker</a> » <a href=/blog>blog</a> » java-segfault-redux</header>
<article>
<h1 id="my-favorite-bug-segfaults-in-java-redux">My favorite bug:
segfaults in Java (redux)</h1>
<p>Two years ago, I <a href="./java-segfault.html">wrote</a> about one
of my favorite bugs that I’d squashed two years before that. About a
year after that, someone posted it <a
href="https://news.ycombinator.com/item?id=9283571">on Hacker
News</a>.</p>
<p>There was some fun discussion about it, but also some confusion.
After finishing a season of mentoring team 4272, I’ve decided that it
would be fun to re-visit the article, and dig up the old actual code,
instead of pseudo-code, hopefully improving the clarity (and providing a
light introduction for anyone wanting to get into modifying the current
SmartDashbaord).</p>
<h2 id="the-context">The context</h2>
<p>In 2012, I was a high school senior, and lead programmer programmer
on the FIRST Robotics Competition team 1024. For the unfamiliar, the
relevant part of the setup is that there are 2 minute and 15 second
matches in which you have a 120 pound robot that sometimes runs
autonomously, and sometimes is controlled over WiFi from a person at a
laptop running stock “driver station” software and modifiable
“dashboard” software.</p>
<p>That year, we mostly used the dashboard software to allow the human
driver and operator to monitor sensors on the robot, one of them being a
video feed from a web-cam mounted on it. This was really easy because
the new standard dashboard program had a click-and drag interface to add
stock widgets; you just had to make sure the code on the robot was
actually sending the data.</p>
<p>That’s great, until when debugging things, the dashboard would
suddenly vanish. If it was run manually from a terminal (instead of
letting the driver station software launch it), you would see a core
dump indicating a segmentation fault.</p>
<p>This wasn’t just us either; I spoke with people on other teams,
everyone who was streaming video had this issue. But, because it only
happened every couple of minutes, and a match is only 2:15, it didn’t
need to run very long, they just crossed their fingers and hoped it
didn’t happen during a match.</p>
<p>The dashboard was written in Java, and the source was available
(under a 3-clause BSD license) via read-only SVN at
<code>http://firstforge.wpi.edu/svn/repos/smart_dashboard/trunk</code>
(which is unfortunately no longer online, fortunately I’d posted some
snapshots on the web). So I dove in, hunting for the bug.</p>
<p>The repository was divided into several NetBeans projects (not
exhaustively listed):</p>
<ul>
<li><a
href="https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-client-2012-1-any.src.tar.xz;hb=HEAD"><code>client/smartdashboard</code></a>:
The main dashboard program, has a plugin architecture.</li>
<li><a
href="https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=wpijavacv-208-1-any.src.tar.xz;hb=HEAD"><code>WPIJavaCV</code></a>:
A higher-level wrapper around JavaCV, itself a Java Native Interface
(JNI) wrapper to talk to OpenCV (C and C++).</li>
<li><a
href="https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-extension-wpicameraextension-210-1-any.src.tar.xz;hb=HEAD"><code>extensions/camera/WPICameraExtension</code></a>:
The standard camera feed plugin, processes the video through
WPIJavaCV.</li>
</ul>
<p>I figured that the bug must be somewhere in the C or C++ code that
was being called by JavaCV, because that’s the language where segfaults
happen. It was especially a pain to track down the pointers that were
causing the issue, because it was hard with native debuggers to see
through all of the JVM stuff to the OpenCV code, and the OpenCV stuff is
opaque to Java debuggers.</p>
<p>Eventually the issue lead me back into the WPICameraExtension, then
into WPIJavaCV—there was a native pointer being stored in a Java
variable; Java code called the native routine to <code>free()</code> the
structure, but then tried to feed it to another routine later. This lead
to difficulty again—tracking objects with Java debuggers was hard
because they don’t expect the program to suddenly segfault; it’s Java
code, Java doesn’t segfault, it throws exceptions!</p>
<p>With the help of <code>println()</code> I was eventually able to see
that some code was executing in an order that straight didn’t make
sense.</p>
<h2 id="the-bug">The bug</h2>
<p>The basic flow of WPIJavaCV is you have a <code>WPICamera</code>, and
you call <code>.getNewImage()</code> on it, which gives you a
<code>WPIImage</code>, which you could do all kinds of fancy OpenCV
things on, but then ultimately call <code>.getBufferedImage()</code>,
which gives you a <code>java.awt.image.BufferedImage</code> that you can
pass to Swing to draw on the screen. You do this every for frame. Which
is exactly what <code>WPICameraExtension.java</code> did, except that
“all kinds of fancy OpenCV things” consisted only of:</p>
<pre><code>public WPIImage processImage(WPIColorImage rawImage) {
    return rawImage;
}</code></pre>
<p>The idea was that you would extend the class, overriding that one
method, if you wanted to do anything fancy.</p>
<p>One of the neat things about WPIJavaCV was that every OpenCV object
class extended had a <code>finalize()</code> method (via inheriting from
the abstract class <code>WPIDisposable</code>) that freed the underlying
C/C++ memory, so you didn’t have to worry about memory leaks like in
plain JavaCV. To inherit from <code>WPIDisposable</code>, you had to
write a <code>disposed()</code> method that actually freed the memory.
This was better than writing <code>finalize()</code> directly, because
it did some safety with NULL pointers and idempotency if you wanted to
manually free something early.</p>
<p>Now, <code>edu.wpi.first.WPIImage.disposed()</code> called <code><a
href="https://github.com/bytedeco/javacv/blob/svn/src/com/googlecode/javacv/cpp/opencv_core.java#L398">com.googlecode.javacv.cpp.opencv_core.IplImage</a>.release()</code>,
which called (via JNI) <code>IplImage:::release()</code>, which called
libc <code>free()</code>:</p>
<pre><code>@Override
protected void disposed() {
    image.release();
}</code></pre>
<p>Elsewhere, the C buffer for the image was copied into a Java buffer
via a similar chain kicked off by
<code>edu.wpi.first.WPIImage.getBufferedImage()</code>:</p>
<pre><code>/**
 * Copies this {@link WPIImage} into a {@link BufferedImage}.
 * This method will always generate a new image.
 * @return a copy of the image
 */
public BufferedImage getBufferedImage() {
    validateDisposed();

    return image.getBufferedImage();
}</code></pre>
<p>The <code>println()</code> output I saw that didn’t make sense was
that <code>someFrame.finalize()</code> was running before
<code>someFrame.getBuffereImage()</code> had returned!</p>
<p>You see, if it is waiting for the return value of a method
<code>m()</code> of object <code>a</code>, and code in <code>m()</code>
that is yet to be executed doesn’t access any other methods or
properties of <code>a</code>, then it will go ahead and consider
<code>a</code> eligible for garbage collection before <code>m()</code>
has finished running.</p>
<p>Put another way, <code>this</code> is passed to a method just like
any other argument. If a method is done accessing <code>this</code>,
then it’s “safe” for the JVM to go ahead and garbage collect it.</p>
<p>That is normally a safe “optimization” to make… except for when a
destructor method (<code>finalize()</code>) is defined for the object;
the destructor can have side effects, and Java has no way to know
whether it is safe for them to happen before <code>m()</code> has
finished running.</p>
<p>I’m not entirely sure if this is a “bug” in the compiler or the
language specification, but I do believe that it’s broken behavior.</p>
<p>Anyway, in this case it’s unsafe with WPI’s code.</p>
<h2 id="my-work-around">My work-around</h2>
<p>My work-around was to change this function in
<code>WPIImage</code>:</p>
<pre><code>public BufferedImage getBufferedImage() {
    validateDisposed();

    return image.getBufferedImage(); // `this` may get garbage collected before it returns!
}</code></pre>
<p>In the above code, <code>this</code> is a <code>WPIImage</code>, and
it may get garbage collected between the time that
<code>image.getBufferedImage()</code> is dispatched, and the time that
<code>image.getBufferedImage()</code> accesses native memory. When it is
garbage collected, it calls <code>image.release()</code>, which
<code>free()</code>s that native memory. That seems pretty unlikely to
happen; that’s a very small gap of time. However, running 30 times a
second, eventually bad luck with the garbage collector happens, and the
program crashes.</p>
<p>The work-around was to insert a bogus call to this to keep
<code>this</code> around until after we were also done with
<code>image</code>:</p>
<p>to this:</p>
<pre><code>public BufferedImage getBufferedImage() {
    validateDisposed();
    BufferedImage ret = image.getBufferedImage();
    getWidth(); // bogus call to keep `this` around
    return ret;
}</code></pre>
<p>Yeah. After spending weeks wading through though thousands of lines
of Java, C, and C++, a bogus call to a method I didn’t care about was
the fix.</p>
<p>TheLoneWolfling on Hacker News noted that they’d be worried about the
JVM optimizing out the call to <code>getWidth()</code>. I’m not, because
<code>WPIImage.getWidth()</code> calls <code>IplImage.width()</code>,
which is declared as <code>native</code>; the JVM must run it because it
might have side effects. On the other hand, looking back, I think I just
shrunk the window for things to go wrong: it may be possible for the
garbage collection to trigger in the time between
<code>getWidth()</code> being dispatched and <code>width()</code>
running. Perhaps there was something in the C/C++ code that made it
safe, I don’t recall, and don’t care quite enough to dig into OpenCV
internals again. Or perhaps I’m mis-remembering the fix (which I don’t
actually have a file of), and I called some other method that
<em>could</em> get optimized out (though I <em>do</em> believe that it
was either <code>getWidth()</code> or <code>getHeight()</code>).</p>
<h2 id="wpis-fix">WPI’s fix</h2>
<p>Four years later, the SmartDashboard is still being used! But it no
longer has this bug, and it’s not using my workaround. So, how did the
WPILib developers fix it?</p>
<p>Well, the code now lives <a
href="https://usfirst.collab.net/gerrit/#/admin/projects/">in git at
collab.net</a>, so I decided to take a look.</p>
<p>The stripped out WPIJavaCV from the main video feed widget, and now
use a purely Java implementation of MPJPEG streaming.</p>
<p>However, the old video feed widget is still available as an extension
(so that you can still do cool things with <code>processImage</code>),
and it also no longer has this bug. Their fix was to put a mutex around
all accesses to <code>image</code>, which should have been the obvious
solution to me.</p>

</article>
<footer>
  <aside class="sponsor"><p>I'd love it if you <a class="em"
      href="/sponsor/">sponsored me</a>.  It will allow me to continue
      my work on the GNU/Linux ecosystem.  Thanks!</p></aside>

<p>The content of this page is Copyright © 2016 <a href="mailto:lukeshu@lukeshu.com">Luke T. Shumaker</a>.</p>
<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a> license.</p>
</footer>
</body>
</html>