summaryrefslogtreecommitdiff
path: root/public/java-segfault-redux.md
blob: cc4254d893d34f72466dd3ceede8b1e546859633 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
My favorite bug: segfaults in Java (redux)
==========================================
---
date: "2016-02-28"
---

Two years ago, I [wrote][original] about one of my favorite bugs that
I'd squashed two years before that.  About a year after that, someone
posted it [on Hacker News][HN]

There was some fun discussion about it, but also some confusion.
After finishing a season of mentoring team 4272, I've decided that it
would be fun to re-visit the article, and dig up the old actual code,
instead of pseudo-code, hopefully improving the clarity (and providing
a light introduction for anyone wanting to get into modifying the
current SmartDashbaord).

The context
-----------

In 2012, I was a high school senior, and lead programmer programmer on
the FIRST Robotics Competition team 1024.  For the unfamiliar, the
relevant part of the setup is that there are 2 minute and 15 second
matches in which you have a 120 pound robot that sometimes runs
autonomously, and sometimes is controlled over WiFi from a person at a
laptop running stock "driver station" software and modifiable
"dashboard" software.

That year, we mostly used the dashboard software to allow the human
driver and operator to monitor sensors on the robot, one of them being
a video feed from a web-cam mounted on it.  This was really easy
because the new standard dashboard program had a click-and drag
interface to add stock widgets; you just had to make sure the code on
the robot was actually sending the data.

That's great, until when debugging things, the dashboard would
suddenly vanish.  If it was run manually from a terminal (instead of
letting the driver station software launch it), you would see a core
dump indicating a segmentation fault.

This wasn't just us either; I spoke with people on other teams,
everyone who was streaming video had this issue.  But, because it only
happened every couple of minutes, and a match is only 2:15, it didn't
need to run very long, they just crossed their fingers and hoped it
didn't happen during a match.

The dashboard was written in Java, and the source was available (under
a 3-clause BSD license) via read-only SVN at
`http://firstforge.wpi.edu/svn/repos/smart_dashboard/trunk` (which is
unfortunately no longer online, fortunately I'd posted some snapshots
on the web).  So I dove in, hunting for the bug.

The repository was divided into several NetBeans projects (not
exhaustively listed):

 - [`client/smartdashboard`][svn/smartdashboard]: The main dashboard
   program, has a plugin architecture.
 - [`WPIJavaCV`][svn/wpijavacv]: A higher-level wrapper around JavaCV,
   itself a Java Native Interface (JNI) wrapper to talk to OpenCV (C
   and C++).
 - [`extensions/camera/WPICameraExtension`][svn/wpicameraextension]:
   The standard camera feed plugin, processes the video through
   WPIJavaCV.

I figured that the bug must be somewhere in the C or C++ code that was
being called by JavaCV, because that's the language where segfaults
happen.  It was especially a pain to track down the pointers that were
causing the issue, because it was hard with native debuggers to see
through all of the JVM stuff to the OpenCV code, and the OpenCV stuff
is opaque to Java debuggers.

Eventually the issue lead me back into the WPICameraExtension, then
into WPIJavaCV--there was a native pointer being stored in a Java
variable; Java code called the native routine to `free()` the
structure, but then tried to feed it to another routine later.  This
lead to difficulty again--tracking objects with Java debuggers was
hard because they don't expect the program to suddenly segfault; it's
Java code, Java doesn't segfault, it throws exceptions!

With the help of `println()` I was eventually able to see that some
code was executing in an order that straight didn't make sense.

The bug
-------

The basic flow of WPIJavaCV is you have a `WPICamera`, and you call
`.getNewImage()` on it, which gives you a `WPIImage`, which you could
do all kinds of fancy OpenCV things on, but then ultimately call
`.getBufferedImage()`, which gives you a
`java.awt.image.BufferedImage` that you can pass to Swing to draw on
the screen.  You do this every for frame.  Which is exactly what
`WPICameraExtension.java` did, except that "all kinds of fancy OpenCV
things" consisted only of:

    public WPIImage processImage(WPIColorImage rawImage) {
        return rawImage;
    }

The idea was that you would extend the class, overriding that one
method, if you wanted to do anything fancy.

One of the neat things about WPIJavaCV was that every OpenCV object
class extended had a `finalize()` method (via inheriting from the
abstract class `WPIDisposable`) that freed the underlying C/C++
memory, so you didn't have to worry about memory leaks like in plain
JavaCV.  To inherit from `WPIDisposable`, you had to write a
`disposed()` method that actually freed the memory.  This was better
than writing `finalize()` directly, because it did some safety with
NULL pointers and idempotency if you wanted to manually free something
early.

Now, `edu.wpi.first.WPIImage.disposed()` called
<code>[com.googlecode.javacv.cpp.opencv_core.IplImage][].release()</code>,
which called (via JNI) `IplImage:::release()`, which called libc
`free()`:

    @Override
    protected void disposed() {
        image.release();
    }

Elsewhere, the C buffer for the image was copied into a Java buffer
via a similar chain kicked off by
`edu.wpi.first.WPIImage.getBufferedImage()`:

    /**
     * Copies this {@link WPIImage} into a {@link BufferedImage}.
     * This method will always generate a new image.
     * @return a copy of the image
     */
    public BufferedImage getBufferedImage() {
        validateDisposed();

        return image.getBufferedImage();
    }

The `println()` output I saw that didn't make sense was that
`someFrame.finalize()` was running before
`someFrame.getBuffereImage()` had returned!

You see, if it is waiting for the return value of a method `m()` of
object `a`, and code in `m()` that is yet to be executed doesn't
access any other methods or properties of `a`, then it will go ahead
and consider `a` eligible for garbage collection before `m()` has
finished running.

Put another way, `this` is passed to a method just like any other
argument.  If a method is done accessing `this`, then it's "safe" for
the JVM to go ahead and garbage collect it.

That is normally a safe "optimization" to make… except for when a
destructor method (`finalize()`) is defined for the object; the
destructor can have side effects, and Java has no way to know whether
it is safe for them to happen before `m()` has finished running.

I'm not entirely sure if this is a "bug" in the compiler or the
language specification, but I do believe that it's broken behavior.

Anyway, in this case it's unsafe with WPI's code.

My work-around
---------------

My work-around was to change this function in `WPIImage`:

    public BufferedImage getBufferedImage() {
        validateDisposed();

        return image.getBufferedImage(); // `this` may get garbage collected before it returns!
    }

In the above code, `this` is a `WPIImage`, and it may get garbage
collected between the time that `image.getBufferedImage()` is
dispatched, and the time that `image.getBufferedImage()` accesses
native memory.  When it is garbage collected, it calls
`image.release()`, which `free()`s that native memory.  That seems
pretty unlikely to happen; that's a very small gap of time.  However,
running 30 times a second, eventually bad luck with the garbage
collector happens, and the program crashes.

The work-around was to insert a bogus call to this to keep `this`
around until after we were also done with `image`:

to this:

    public BufferedImage getBufferedImage() {
        validateDisposed();
        BufferedImage ret = image.getBufferedImage();
        getWidth(); // bogus call to keep `this` around
        return ret;
    }
	
Yeah.  After spending weeks wading through though thousands of lines
of Java, C, and C++, a bogus call to a method I didn't care about was
the fix.

TheLoneWolfling on Hacker News noted that they'd be worried about the
JVM optimizing out the call to `getWidth()`.  I'm not, because
`WPIImage.getWidth()` calls `IplImage.width()`, which is declared as
`native`; the JVM must run it because it might have side effects.  On
the other hand, looking back, I think I just shrunk the window for
things to go wrong: it may be possible for the garbage collection to
trigger in the time between `getWidth()` being dispatched and
`width()` running.  Perhaps there was something in the C/C++ code that
made it safe, I don't recall, and don't care quite enough to dig into
OpenCV internals again.  Or perhaps I'm mis-remembering the fix (which
I don't actually have a file of), and I called some other method that
_could_ get optimized out (though I _do_ believe that it was either
`getWidth()` or `getHeight()`).

WPI's fix
---------

Four years later, the SmartDashboard is still being used!  But it no
longer has this bug, and it's not using my workaround.  So, how did
the WPILib developers fix it?

Well, the code now lives [in git at collab.net][git], so I decided to
take a look.  They put a mutex around all accesses to to `image`,
which should have been the obvious solution to me.

[original]: ./java-segfault.html
[HN]: https://news.ycombinator.com/item?id=9283571
[com.googlecode.javacv.cpp.opencv_core.IplImage]: https://github.com/bytedeco/javacv/blob/svn/src/com/googlecode/javacv/cpp/opencv_core.java#L398
[svn/smartdashboard]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-client-2012-1-any.src.tar.xz;hb=HEAD
[svn/wpijavacv]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=wpijavacv-208-1-any.src.tar.xz;hb=HEAD
[svn/wpicameraextension]: https://gitorious.org/absfrc/sources.git/?p=absfrc:sources.git;a=blob_plain;f=smartdashboard-extension-wpicameraextension-210-1-any.src.tar.xz;hb=HEAD
[git]: https://usfirst.collab.net/gerrit/#/admin/projects/