summaryrefslogtreecommitdiff
path: root/public/bash-arrays.md
blob: 23d90bbdeb7e51bc755996a28f040953706505b3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
Bash arrays
===========
---
date: "2013-10-13"
---

Way too many people don't understand Bash arrays.  Many of them argue
that if you need arrays, you shouldn't be using Bash.  If we reject
the notion that one should never use Bash for scripting, then thinking
you don't need Bash arrays is what I like to call "wrong".  I don't
even mean real scripting; even these little stubs in `/usr/bin`:

	#!/bin/sh
	java -jar /…/something.jar $* # WRONG!

Command line arguments are exposed as an array, that little `$*` is
accessing it, and is doing the wrong thing (for the lazy, the correct
thing is `-- "$@"`).  Arrays in Bash offer a safe way preserve field
separation.

One of the main sources of bugs (and security holes) in shell scripts
is field separation.  That's what arrays are about.

What?  Field separation?
------------------------

Field separation is just splitting a larger unit into a list of
"fields".  The most common case is when Bash splits a "simple command"
(in the Bash manual's terminology) into a list of arguments.
Understanding how this works is an important prerequisite to
understanding arrays, and even why they are important.

Dealing with lists is something that is very common in Bash scripts;
from dealing with lists of arguments, to lists of files; they pop up a
lot, and each time, you need to think about how the list is
separated.  In the case of `$PATH`, the list is separated by colons.
In the case of `$CFLAGS`, the list is separated by whitespace.  In the
case of actual arrays, it's easy, there's no special character to
worry about, just quote it, and you're good to go.

Bash word splitting
-------------------

When Bash reads a "simple command", it splits the whole thing into a
list of "words".  "The first word specifies the command to be
executed, and is passed as argument zero.  The remaining words are
passed as arguments to the invoked command." (to quote `bash(1)`)

It is often hard for those unfamiliar with Bash to understand when
something is multiple words, and when it is a single word that just
contains a space or newline.  To help gain an intuitive understanding,
I recommend using the following command to print a bullet list of
words, to see how Bash splits them up:

<pre><code>printf ' -> %s\n' <var>words…</var><hr> -&gt; word one
 -&gt; multiline
word
 -&gt; third word
</code></pre>

In a simple command, in absence of quoting, Bash separates the "raw"
input into words by splitting on spaces and tabs.  In other places,
such as when expanding a variable, it uses the same process, but
splits on the characters in the `$IFS` variable (which has the default
value of space/tab/newline).  This process is, creatively enough,
called "word splitting".

In most discussions of Bash arrays, one of the frequent criticisms is
all the footnotes and "gotchas" about when to quote things.  That's
because they usually don't set the context of word splitting.
**Double quotes (`"`) inhibit Bash from doing word splitting.**
That's it, that's all they do.  Arrays are already split into words;
without wrapping them  in double quotes Bash re-word splits them,
which is almost *never* what you want; otherwise, you wouldn't be
working with an array.

Normal array syntax
-------------------

<table>
  <caption>
    <h1>Setting an array</h1>
    <p><var>words…</var> is expanded and subject to word splitting
       based on <code>$IFS</code>.</p>
  </caption>
  <tbody>
    <tr>
      <td><code>array=(<var>words…</var>)</code></td>
      <td>Set the contents of the entire array.</td>
    </tr><tr>
      <td><code>array+=(<var>words…</var>)</code></td>
      <td>Appends <var>words…</var> to the end of the array.</td>
    </tr><tr>
      <td><code>array[<var>n</var>]=<var>word</var></code></td>
      <td>Sets an individual entry in the array, the first entry is at
          <var>n</var>=0.</td>
    </tr>
  </tbody>
</table>

Now, for accessing the array.  The most important things to
understanding arrays is to quote them, and understanding the
difference between `@` and `*`.

<table>
  <caption>
    <h1>Getting an entire array</h1>
    <p>Unless these are wrapped in double quotes, they are subject to
       word splitting, which defeats the purpose of arrays.</p>
    <p>I guess it's worth mentioning that if you don't quote them, and
       word splitting is applied, <code>@</code> and <code>*</code>
       end up being equivalent.</p>
    <p>With <code>*</code>, when joining the elements into a single
       string, the elements are separated by the first character in
       <code>$IFS</code>, which is, by default, a space.</p>
  </caption>
  <tbody>
    <tr>
      <td><code>"${array[@]}"</code></td>
      <td>Evaluates to every element of the array, as a separate
          words.</td>
    </tr><tr>
      <td><code>"${array[*]}"</code></td>
      <td>Evaluates to every element of the array, as a single
          word.</td>
    </tr>
  </tbody>
</table>

It's really that simple—that covers most usages of arrays, and most of
the mistakes made with them.

To help you understand the difference between `@` and `*`, here is a
sample of each:

<table>
  <tbody>
    <tr><th><code>@</code></th><th><code>*</code></th></tr>
    <tr>
      <td>Input:<pre><code>#!/bin/bash
array=(foo bar baz)
for item in "${array[@]}"; do
        echo " - &lt;${item}&gt;"
done</code></pre></td>
      <td>Input:<pre><code>#!/bin/bash
array=(foo bar baz)
for item in "${array[*]}"; do
        echo " - &lt;${item}&gt;"
done</code></pre></td>
    </tr>
    <tr>
      <td>Output:<pre><code> - &lt;foo&gt;
 - &lt;bar&gt;
 - &lt;baz&gt;</code></pre></td>
     <td>Output:<pre><code> - &lt;foo bar baz&gt;<br><br><br></code></pre></td>
    </tr>
  </tbody>
</table>

In most cases, `@` is what you want, but `*` comes up often enough
too.

To get individual entries, the syntax is
<code>${array[<var>n</var>]}</code>, where <var>n</var> starts at 0.

<table>
  <caption>
    <h1>Getting a single entry from an array</h1>
    <p>Also subject to word splitting if you don't wrap it in
       quotes.</p>
  </caption>
  <tbody>
    <tr>
      <td><code>"${array[<var>n</var>]}"</code></td>
      <td>Evaluates to the <var>n</var><sup>th</sup> entry of the
          array, where the first entry is at <var>n</var>=0.</td>
    </tr>
  </tbody>
</table>

To get a subset of the array, there are a few options:

<table>
  <caption>
    <h1>Getting subsets of an array</h1>
    <p>Substitute <code>*</code> for <code>@</code> to get the subset
       as a <code>$IFS</code>-separated string instead of separate
       words, as described above.</p>
    <p>Again, if you don't wrap these in double quotes, they are
       subject to word splitting, which defeats the purpose of
       arrays.</p>
  </caption>
  <tbody>
    <tr>
      <td><code>"${array[@]:<var>start</var>}"</code></td>
      <td>Evaluates to the entries from <var>n</var>=<var>start</var> to the end
          of the array.</td>
    </tr><tr>
      <td><code>"${array[@]:<var>start</var>:<var>count</var>}"</code></td>
      <td>Evaluates to <var>count</var> entries, starting at
          <var>n</var>=<var>start</var>.</td>
    </tr><tr>
      <td><code>"${array[@]::<var>count</var>}"</code></td>
      <td>Evaluates to <var>count</var> entries from the beginning of
          the array.</td>
    </tr>
  </tbody>
</table>

Notice that `"${array[@]}"` is equivalent to `"${array[@]:0}"`.

<table>
  <caption>
    <h1>Getting the length of an array</h1>
    <p>The is the only situation with arrays where quoting doesn't
       make a difference.</p>
    <p>True to my earlier statement, when unquoted, there is no
       difference between <code>@</code> and <code>*</code>.</p>
  </caption>
  <tbody>
    <tr>
      <td>
        <code>${#array[@]}</code>
        <br>or<br>
        <code>${#array[*]}</code>
      </td>
      <td>
        Evaluates to the length of the array
      </td>
    </tr>
  </tbody>
</table>

Argument array syntax
---------------------

Accessing the arguments is mostly that simple, but that array doesn't
actually have a variable name.  It's special.  Instead, it is exposed
through a series of special variables (normal variables can only start
with letters and underscore), that *mostly* match up with the normal
array syntax.

Setting the arguments array, on the other hand, is pretty different.
That's fine, because setting the arguments array is less useful
anyway.

<table>
  <caption>
    <h1>Accessing the arguments array</h1>
    <aside>Note that for values of <var>n</var> with more than 1
           digit, you need to wrap it in <code>{}</code>.
           Otherwise, <code>"$10"</code> would be parsed
           as <code>"${1}0"</code>.</aside>
  </caption>
  <tbody>
    <tr><th colspan=2>Individual entries</th></tr>
    <tr><td><code>${array[0]}</code></td><td><code>$0</code></td></tr>
    <tr><td><code>${array[1]}</code></td><td><code>$1</code></td></tr>
    <tr><td colspan=2 style="text-align:center">…</td></tr>
    <tr><td><code>${array[9]}</code></td><td><code>$9</code></td></tr>
    <tr><td><code>${array[10]}</code></td><td><code>${10}</code></td></tr>
    <tr><td colspan=2 style="text-align:center">…</td></tr>
    <tr><td><code>${array[<var>n</var>]}</code></td><td><code>${<var>n</var>}</code></td></tr>
    <tr><th colspan=2>Subset arrays (array)</th></tr>
    <tr><td><code>"${array[@]}"</code></td><td><code>"${@:0}"</code></td></tr>
    <tr><td><code>"${array[@]:1}"</code></td><td><code>"$@"</code></td></tr>
    <tr><td><code>"${array[@]:<var>pos</var>}"</code></td><td><code>"${@:<var>pos</var>}"</code></td></tr>
    <tr><td><code>"${array[@]:<var>pos</var>:<var>len</var>}"</code></td><td><code>"${@:<var>pos</var>:<var>len</var>}"</code></td></tr>
    <tr><td><code>"${array[@]::<var>len</var>}"</code></td><td><code>"${@::<var>len</var>}"</code></td></tr>
    <tr><th colspan=2>Subset arrays (string)</th></tr>
    <tr><td><code>"${array[*]}"</code></td><td><code>"${*:0}"</code></td></tr>
    <tr><td><code>"${array[*]:1}"</code></td><td><code>"$*"</code></td></tr>
    <tr><td><code>"${array[*]:<var>pos</var>}"</code></td><td><code>"${*:<var>pos</var>}"</code></td></tr>
    <tr><td><code>"${array[*]:<var>pos</var>:<var>len</var>}"</code></td><td><code>"${*:<var>pos</var>:<var>len</var>}"</code></td></tr>
    <tr><td><code>"${array[*]::<var>len</var>}"</code></td><td><code>"${*::<var>len</var>}"</code></td></tr>
    <tr><th colspan=2>Array length</th></tr>
    <tr><td><code>${#array[@]}</code></td><td><code>$#</code> + 1</td></tr>
    <tr><th colspan=2>Setting the array</th></tr>
    <tr><td><code>array=("${array[0]}" <var>words…</var>)</code></td><td><code>set -- <var>words…</var></code></td></tr>
    <tr><td><code>array=("${array[0]}" "${array[@]:2}")</code></td><td><code>shift</code></td></tr>
    <tr><td><code>array=("${array[0]}" "${array[@]:<var>n+1</var>}")</code></td><td><code>shift <var>n</var></code></td></tr>
  </tbody>
</table>

Did you notice what was inconsistent? The variables `$*`, `$@`, and
`$#` behave like the <var>n</var>=0 entry doesn't exist.

<table>
  <caption>
    <h1>Inconsistencies</h1>
  </caption>
  <tbody>
    <tr>
      <th colspan=3><code>@</code> or <code>*</code></th>
    </tr><tr>
      <td><code>"${array[@]}"</code></td>
      <td>→</td>
      <td><code>"${array[@]:0}"</code></td>
    </tr><tr>
      <td><code>"${@}"</code></td>
      <td>→</td>
      <td><code>"${@:1}"</code></td>
    </tr><tr>
      <th colspan=3><code>#</code></th>
    </tr><tr>
      <td><code>"${#array[@]}"</code></td>
      <td>→</td>
      <td>length</td>
    </tr><tr>
      <td><code>"${#}"</code></td>
      <td>→</td>
      <td>length-1</td>
    </tr> 
  </tbody>
</table>

These make sense because argument 0 is the name of the script—we
almost never want that when parsing arguments. You'd spend more code
getting the values that it currently gives you.

Now, for an explanation of setting the arguments array.  You cannot
set argument <var>n</var>=0.  The `set` command is used to manipulate
the arguments passed to Bash after the fact—similarly, you could use
`set -x` to make Bash behave like you ran it as `bash -x`; like most
GNU programs, the `--` tells it to not parse any of the options as
flags. The `shift` command shifts each entry <var>n</var> spots to the
left, using <var>n</var>=1 if no value is specified; and leaving
argument 0 alone.

But you mentioned "gotchas" about quoting!
------------------------------------------

But I explained that quoting simply inhibits word splitting, which you
pretty much never want when working with arrays.  If, for some odd
reason, you do what word splitting, then that's when you don't quote.
Simple, easy to understand.

I think possibly the only case where you do want word splitting with
an array is when you didn't want an array, but it's what you get
(arguments are, by necessity, an array).  For example:

	# Usage: path_ls PATH1 PATH2…
	# Description:
	#   Takes any number of PATH-style values; that is,
	#   colon-separated lists of directories, and prints a
	#   newline-separated list of executables found in them.
	# Bugs:
	#   Does not correctly handle programs with a newline in the name,
	#   as the output is newline-separated.
	path_ls() {
		local dirs
		IFS=: dirs=($@) # The odd-ball time that it needs to be unquoted
		find -L "${dirs[@]}" -maxdepth 1 -type f -executable \
			-printf '%f\n' 2>/dev/null | sort -u
	}

Logically, there shouldn't be multiple arguments, just a single
`$PATH` value; but, we can't enforce that, as the array can have any
size.  So, we do the robust thing, and just act on the entire array,
not really caring about the fact that it is an array.  Alas, there is
still a field-separation bug in the program, with the output.

I still don't think I need arrays in my scripts
-----------------------------------------------

Consider the common code:

	ARGS=' -f -q'
	…
	command $ARGS  # unquoted variables are a bad code-smell anyway

Here, `$ARGS` is field-separated by `$IFS`, which we are assuming has
the default value.  This is fine, as long as `$ARGS` is known to never
need an embedded space; which you do as long as it isn't based on
anything outside of the program.  But wait until you want to do this:

	ARGS=' -f -q'
	…
	if [[ -f "$filename" ]]; then
		ARGS+=" -F $filename"
	fi
	…
	command $ARGS

Now you're hosed if `$filename` contains a space!  More than just
breaking, it could have unwanted side effects, such as when someone
figures out how to make `filename='foo --dangerous-flag'`.

Compare that with the array version:

	ARGS=(-f -q)
	…
	if [[ -f "$filename" ]]; then
		ARGS+=(-F "$filename")
	fi
	…
	command "${ARGS[@]}"

What about portability?
-----------------------

Except for the little stubs that call another program with `"$@"` at
the end, trying to write for multiple shells (including the ambiguous
`/bin/sh`) is not a task for mere mortals.  If you do try that, your
best bet is probably sticking to POSIX.  Arrays are not POSIX; except
for the arguments array, which is; though getting subset arrays from
`$@` and `$*` is not (tip: use `set --` to re-purpose the arguments array).

Writing for various versions of Bash, though, is pretty do-able.
Everything here works all the way back in bash-2.0 (December 1996),
with the following exceptions:

 * The `+=` operator wasn't added until Bash 3.1.

    * As a work-around, use
      <code>array[${#array[*]}]=<var>word</var></code> to append a
      single element.

 * Accessing subset arrays of the arguments array is inconsistent if
   <var>pos</var>=0 in <code>${@:<var>pos</var>:<var>len</var>}</code>.

    * In Bash 2.x and 3.x, it works as expected, except that argument
      0  is silently missing. For example `${@:0:3}` gives arguments 1
      and 2; where `${@:1:3}` gives arguments 1, 2, and 3.  This means
      that if <var>pos</var>=0, then only <var>len</var>-1 arguments
      are given back.
    * In Bash 4.0, argument 0 can be accessed, but if
      <var>pos</var>=0, then it only gives back <var>len</var>-1
      arguments.  So, `${@:0:3}` gives arguments 0 and 1.
    * In Bash 4.1 and higher, it works in the way described in the
      main part of this document.

Now, Bash 1.x doesn't have arrays at all.  `$@` and `$*` work, but
using `:` to select a range of elements from them doesn't.  Good thing
most boxes have been updated since 1996!