summaryrefslogtreecommitdiff
path: root/public/bash-arrays.html
diff options
context:
space:
mode:
Diffstat (limited to 'public/bash-arrays.html')
-rw-r--r--public/bash-arrays.html410
1 files changed, 410 insertions, 0 deletions
diff --git a/public/bash-arrays.html b/public/bash-arrays.html
new file mode 100644
index 0000000..76132f5
--- /dev/null
+++ b/public/bash-arrays.html
@@ -0,0 +1,410 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+ <meta charset="utf-8">
+ <title>Bash arrays — Luke T. Shumaker</title>
+ <meta name="viewport" content="width=device-width, initial-scale=1">
+ <link rel="stylesheet" href="assets/style.css">
+ <link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/>
+</head>
+<body>
+<header><a href="/">Luke T. Shumaker</a> » <a href=/blog>blog</a> » bash-arrays</header>
+<article>
+<h1 id="bash-arrays">Bash arrays</h1>
+<p>Way too many people don’t understand Bash arrays. Many of them argue
+that if you need arrays, you shouldn’t be using Bash. If we reject the
+notion that one should never use Bash for scripting, then thinking you
+don’t need Bash arrays is what I like to call “wrong”. I don’t even mean
+real scripting; even these little stubs in <code>/usr/bin</code>:</p>
+<pre><code>#!/bin/sh
+java -jar /…/something.jar $* # WRONG!</code></pre>
+<p>Command line arguments are exposed as an array, that little
+<code>$*</code> is accessing it, and is doing the wrong thing (for the
+lazy, the correct thing is <code>-- "$@"</code>). Arrays in Bash offer a
+safe way preserve field separation.</p>
+<p>One of the main sources of bugs (and security holes) in shell scripts
+is field separation. That’s what arrays are about.</p>
+<h2 id="what-field-separation">What? Field separation?</h2>
+<p>Field separation is just splitting a larger unit into a list of
+“fields”. The most common case is when Bash splits a “simple command”
+(in the Bash manual’s terminology) into a list of arguments.
+Understanding how this works is an important prerequisite to
+understanding arrays, and even why they are important.</p>
+<p>Dealing with lists is something that is very common in Bash scripts;
+from dealing with lists of arguments, to lists of files; they pop up a
+lot, and each time, you need to think about how the list is separated.
+In the case of <code>$PATH</code>, the list is separated by colons. In
+the case of <code>$CFLAGS</code>, the list is separated by whitespace.
+In the case of actual arrays, it’s easy, there’s no special character to
+worry about, just quote it, and you’re good to go.</p>
+<h2 id="bash-word-splitting">Bash word splitting</h2>
+<p>When Bash reads a “simple command”, it splits the whole thing into a
+list of “words”. “The first word specifies the command to be executed,
+and is passed as argument zero. The remaining words are passed as
+arguments to the invoked command.” (to quote <code>bash(1)</code>)</p>
+<p>It is often hard for those unfamiliar with Bash to understand when
+something is multiple words, and when it is a single word that just
+contains a space or newline. To help gain an intuitive understanding, I
+recommend using the following command to print a bullet list of words,
+to see how Bash splits them up:</p>
+<pre><code>printf ' -> %s\n' <var>words…</var><hr> -&gt; word one
+ -&gt; multiline
+word
+ -&gt; third word
+</code></pre>
+<p>In a simple command, in absence of quoting, Bash separates the “raw”
+input into words by splitting on spaces and tabs. In other places, such
+as when expanding a variable, it uses the same process, but splits on
+the characters in the <code>$IFS</code> variable (which has the default
+value of space/tab/newline). This process is, creatively enough, called
+“word splitting”.</p>
+<p>In most discussions of Bash arrays, one of the frequent criticisms is
+all the footnotes and “gotchas” about when to quote things. That’s
+because they usually don’t set the context of word splitting.
+<strong>Double quotes (<code>"</code>) inhibit Bash from doing word
+splitting.</strong> That’s it, that’s all they do. Arrays are already
+split into words; without wrapping them in double quotes Bash re-word
+splits them, which is almost <em>never</em> what you want; otherwise,
+you wouldn’t be working with an array.</p>
+<h2 id="normal-array-syntax">Normal array syntax</h2>
+<table>
+ <caption>
+ <h1>Setting an array</h1>
+ <p><var>words…</var> is expanded and subject to word splitting
+ based on <code>$IFS</code>.</p>
+ </caption>
+ <tbody>
+ <tr>
+ <td><code>array=(<var>words…</var>)</code></td>
+ <td>Set the contents of the entire array.</td>
+ </tr><tr>
+ <td><code>array+=(<var>words…</var>)</code></td>
+ <td>Appends <var>words…</var> to the end of the array.</td>
+ </tr><tr>
+ <td><code>array[<var>n</var>]=<var>word</var></code></td>
+ <td>Sets an individual entry in the array, the first entry is at
+ <var>n</var>=0.</td>
+ </tr>
+ </tbody>
+</table>
+<p>Now, for accessing the array. The most important things to
+understanding arrays is to quote them, and understanding the difference
+between <code>@</code> and <code>*</code>.</p>
+<table>
+ <caption>
+ <h1>Getting an entire array</h1>
+ <p>Unless these are wrapped in double quotes, they are subject to
+ word splitting, which defeats the purpose of arrays.</p>
+ <p>I guess it's worth mentioning that if you don't quote them, and
+ word splitting is applied, <code>@</code> and <code>*</code>
+ end up being equivalent.</p>
+ <p>With <code>*</code>, when joining the elements into a single
+ string, the elements are separated by the first character in
+ <code>$IFS</code>, which is, by default, a space.</p>
+ </caption>
+ <tbody>
+ <tr>
+ <td><code>"${array[@]}"</code></td>
+ <td>Evaluates to every element of the array, as a separate
+ words.</td>
+ </tr><tr>
+ <td><code>"${array[*]}"</code></td>
+ <td>Evaluates to every element of the array, as a single
+ word.</td>
+ </tr>
+ </tbody>
+</table>
+<p>It’s really that simple—that covers most usages of arrays, and most
+of the mistakes made with them.</p>
+<p>To help you understand the difference between <code>@</code> and
+<code>*</code>, here is a sample of each:</p>
+<table>
+ <tbody>
+ <tr><th><code>@</code></th><th><code>*</code></th></tr>
+ <tr>
+ <td>Input:<pre><code>#!/bin/bash
+array=(foo bar baz)
+for item in "${array[@]}"; do
+ echo " - &lt;${item}&gt;"
+done</code></pre></td>
+ <td>Input:<pre><code>#!/bin/bash
+array=(foo bar baz)
+for item in "${array[*]}"; do
+ echo " - &lt;${item}&gt;"
+done</code></pre></td>
+ </tr>
+ <tr>
+ <td>Output:<pre><code> - &lt;foo&gt;
+ - &lt;bar&gt;
+ - &lt;baz&gt;</code></pre></td>
+ <td>Output:<pre><code> - &lt;foo bar baz&gt;<br><br><br></code></pre></td>
+ </tr>
+ </tbody>
+</table>
+<p>In most cases, <code>@</code> is what you want, but <code>*</code>
+comes up often enough too.</p>
+<p>To get individual entries, the syntax is
+<code>${array[<var>n</var>]}</code>, where <var>n</var> starts at 0.</p>
+<table>
+ <caption>
+ <h1>Getting a single entry from an array</h1>
+ <p>Also subject to word splitting if you don't wrap it in
+ quotes.</p>
+ </caption>
+ <tbody>
+ <tr>
+ <td><code>"${array[<var>n</var>]}"</code></td>
+ <td>Evaluates to the <var>n</var><sup>th</sup> entry of the
+ array, where the first entry is at <var>n</var>=0.</td>
+ </tr>
+ </tbody>
+</table>
+<p>To get a subset of the array, there are a few options:</p>
+<table>
+ <caption>
+ <h1>Getting subsets of an array</h1>
+ <p>Substitute <code>*</code> for <code>@</code> to get the subset
+ as a <code>$IFS</code>-separated string instead of separate
+ words, as described above.</p>
+ <p>Again, if you don't wrap these in double quotes, they are
+ subject to word splitting, which defeats the purpose of
+ arrays.</p>
+ </caption>
+ <tbody>
+ <tr>
+ <td><code>"${array[@]:<var>start</var>}"</code></td>
+ <td>Evaluates to the entries from <var>n</var>=<var>start</var> to the end
+ of the array.</td>
+ </tr><tr>
+ <td><code>"${array[@]:<var>start</var>:<var>count</var>}"</code></td>
+ <td>Evaluates to <var>count</var> entries, starting at
+ <var>n</var>=<var>start</var>.</td>
+ </tr><tr>
+ <td><code>"${array[@]::<var>count</var>}"</code></td>
+ <td>Evaluates to <var>count</var> entries from the beginning of
+ the array.</td>
+ </tr>
+ </tbody>
+</table>
+<p>Notice that <code>"${array[@]}"</code> is equivalent to
+<code>"${array[@]:0}"</code>.</p>
+<table>
+ <caption>
+ <h1>Getting the length of an array</h1>
+ <p>The is the only situation with arrays where quoting doesn't
+ make a difference.</p>
+ <p>True to my earlier statement, when unquoted, there is no
+ difference between <code>@</code> and <code>*</code>.</p>
+ </caption>
+ <tbody>
+ <tr>
+ <td>
+ <code>${#array[@]}</code>
+ <br>or<br>
+ <code>${#array[*]}</code>
+ </td>
+ <td>
+ Evaluates to the length of the array
+ </td>
+ </tr>
+ </tbody>
+</table>
+<h2 id="argument-array-syntax">Argument array syntax</h2>
+<p>Accessing the arguments is mostly that simple, but that array doesn’t
+actually have a variable name. It’s special. Instead, it is exposed
+through a series of special variables (normal variables can only start
+with letters and underscore), that <em>mostly</em> match up with the
+normal array syntax.</p>
+<p>Setting the arguments array, on the other hand, is pretty different.
+That’s fine, because setting the arguments array is less useful
+anyway.</p>
+<table>
+ <caption>
+ <h1>Accessing the arguments array</h1>
+ <aside>Note that for values of <var>n</var> with more than 1
+ digit, you need to wrap it in <code>{}</code>.
+ Otherwise, <code>"$10"</code> would be parsed
+ as <code>"${1}0"</code>.</aside>
+ </caption>
+ <tbody>
+ <tr><th colspan=2>Individual entries</th></tr>
+ <tr><td><code>${array[0]}</code></td><td><code>$0</code></td></tr>
+ <tr><td><code>${array[1]}</code></td><td><code>$1</code></td></tr>
+ <tr><td colspan=2 style="text-align:center">…</td></tr>
+ <tr><td><code>${array[9]}</code></td><td><code>$9</code></td></tr>
+ <tr><td><code>${array[10]}</code></td><td><code>${10}</code></td></tr>
+ <tr><td colspan=2 style="text-align:center">…</td></tr>
+ <tr><td><code>${array[<var>n</var>]}</code></td><td><code>${<var>n</var>}</code></td></tr>
+ <tr><th colspan=2>Subset arrays (array)</th></tr>
+ <tr><td><code>"${array[@]}"</code></td><td><code>"${@:0}"</code></td></tr>
+ <tr><td><code>"${array[@]:1}"</code></td><td><code>"$@"</code></td></tr>
+ <tr><td><code>"${array[@]:<var>pos</var>}"</code></td><td><code>"${@:<var>pos</var>}"</code></td></tr>
+ <tr><td><code>"${array[@]:<var>pos</var>:<var>len</var>}"</code></td><td><code>"${@:<var>pos</var>:<var>len</var>}"</code></td></tr>
+ <tr><td><code>"${array[@]::<var>len</var>}"</code></td><td><code>"${@::<var>len</var>}"</code></td></tr>
+ <tr><th colspan=2>Subset arrays (string)</th></tr>
+ <tr><td><code>"${array[*]}"</code></td><td><code>"${*:0}"</code></td></tr>
+ <tr><td><code>"${array[*]:1}"</code></td><td><code>"$*"</code></td></tr>
+ <tr><td><code>"${array[*]:<var>pos</var>}"</code></td><td><code>"${*:<var>pos</var>}"</code></td></tr>
+ <tr><td><code>"${array[*]:<var>pos</var>:<var>len</var>}"</code></td><td><code>"${*:<var>pos</var>:<var>len</var>}"</code></td></tr>
+ <tr><td><code>"${array[*]::<var>len</var>}"</code></td><td><code>"${*::<var>len</var>}"</code></td></tr>
+ <tr><th colspan=2>Array length</th></tr>
+ <tr><td><code>${#array[@]}</code></td><td><code>$#</code> + 1</td></tr>
+ <tr><th colspan=2>Setting the array</th></tr>
+ <tr><td><code>array=("${array[0]}" <var>words…</var>)</code></td><td><code>set -- <var>words…</var></code></td></tr>
+ <tr><td><code>array=("${array[0]}" "${array[@]:2}")</code></td><td><code>shift</code></td></tr>
+ <tr><td><code>array=("${array[0]}" "${array[@]:<var>n+1</var>}")</code></td><td><code>shift <var>n</var></code></td></tr>
+ </tbody>
+</table>
+<p>Did you notice what was inconsistent? The variables <code>$*</code>,
+<code>$@</code>, and <code>$#</code> behave like the <var>n</var>=0
+entry doesn’t exist.</p>
+<table>
+ <caption>
+ <h1>Inconsistencies</h1>
+ </caption>
+ <tbody>
+ <tr>
+ <th colspan=3><code>@</code> or <code>*</code></th>
+ </tr><tr>
+ <td><code>"${array[@]}"</code></td>
+ <td>→</td>
+ <td><code>"${array[@]:0}"</code></td>
+ </tr><tr>
+ <td><code>"${@}"</code></td>
+ <td>→</td>
+ <td><code>"${@:1}"</code></td>
+ </tr><tr>
+ <th colspan=3><code>#</code></th>
+ </tr><tr>
+ <td><code>"${#array[@]}"</code></td>
+ <td>→</td>
+ <td>length</td>
+ </tr><tr>
+ <td><code>"${#}"</code></td>
+ <td>→</td>
+ <td>length-1</td>
+ </tr>
+ </tbody>
+</table>
+<p>These make sense because argument 0 is the name of the script—we
+almost never want that when parsing arguments. You’d spend more code
+getting the values that it currently gives you.</p>
+<p>Now, for an explanation of setting the arguments array. You cannot
+set argument <var>n</var>=0. The <code>set</code> command is used to
+manipulate the arguments passed to Bash after the fact—similarly, you
+could use <code>set -x</code> to make Bash behave like you ran it as
+<code>bash -x</code>; like most GNU programs, the <code>--</code> tells
+it to not parse any of the options as flags. The <code>shift</code>
+command shifts each entry <var>n</var> spots to the left, using
+<var>n</var>=1 if no value is specified; and leaving argument 0
+alone.</p>
+<h2 id="but-you-mentioned-gotchas-about-quoting">But you mentioned
+“gotchas” about quoting!</h2>
+<p>But I explained that quoting simply inhibits word splitting, which
+you pretty much never want when working with arrays. If, for some odd
+reason, you do what word splitting, then that’s when you don’t quote.
+Simple, easy to understand.</p>
+<p>I think possibly the only case where you do want word splitting with
+an array is when you didn’t want an array, but it’s what you get
+(arguments are, by necessity, an array). For example:</p>
+<pre><code># Usage: path_ls PATH1 PATH2…
+# Description:
+# Takes any number of PATH-style values; that is,
+# colon-separated lists of directories, and prints a
+# newline-separated list of executables found in them.
+# Bugs:
+# Does not correctly handle programs with a newline in the name,
+# as the output is newline-separated.
+path_ls() {
+ local IFS dirs
+ IFS=:
+ dirs=($@) # The odd-ball time that it needs to be unquoted
+ find -L &quot;${dirs[@]}&quot; -maxdepth 1 -type f -executable \
+ -printf &#39;%f\n&#39; 2&gt;/dev/null | sort -u
+}</code></pre>
+<p>Logically, there shouldn’t be multiple arguments, just a single
+<code>$PATH</code> value; but, we can’t enforce that, as the array can
+have any size. So, we do the robust thing, and just act on the entire
+array, not really caring about the fact that it is an array. Alas, there
+is still a field-separation bug in the program, with the output.</p>
+<h2 id="i-still-dont-think-i-need-arrays-in-my-scripts">I still don’t
+think I need arrays in my scripts</h2>
+<p>Consider the common code:</p>
+<pre><code>ARGS=&#39; -f -q&#39;
+…
+command $ARGS # unquoted variables are a bad code-smell anyway</code></pre>
+<p>Here, <code>$ARGS</code> is field-separated by <code>$IFS</code>,
+which we are assuming has the default value. This is fine, as long as
+<code>$ARGS</code> is known to never need an embedded space; which you
+do as long as it isn’t based on anything outside of the program. But
+wait until you want to do this:</p>
+<pre><code>ARGS=&#39; -f -q&#39;
+…
+if [[ -f &quot;$filename&quot; ]]; then
+ ARGS+=&quot; -F $filename&quot;
+fi
+…
+command $ARGS</code></pre>
+<p>Now you’re hosed if <code>$filename</code> contains a space! More
+than just breaking, it could have unwanted side effects, such as when
+someone figures out how to make
+<code>filename='foo --dangerous-flag'</code>.</p>
+<p>Compare that with the array version:</p>
+<pre><code>ARGS=(-f -q)
+…
+if [[ -f &quot;$filename&quot; ]]; then
+ ARGS+=(-F &quot;$filename&quot;)
+fi
+…
+command &quot;${ARGS[@]}&quot;</code></pre>
+<h2 id="what-about-portability">What about portability?</h2>
+<p>Except for the little stubs that call another program with
+<code>"$@"</code> at the end, trying to write for multiple shells
+(including the ambiguous <code>/bin/sh</code>) is not a task for mere
+mortals. If you do try that, your best bet is probably sticking to
+POSIX. Arrays are not POSIX; except for the arguments array, which is;
+though getting subset arrays from <code>$@</code> and <code>$*</code> is
+not (tip: use <code>set --</code> to re-purpose the arguments
+array).</p>
+<p>Writing for various versions of Bash, though, is pretty do-able.
+Everything here works all the way back in bash-2.0 (December 1996), with
+the following exceptions:</p>
+<ul>
+<li><p>The <code>+=</code> operator wasn’t added until Bash 3.1.</p>
+<ul>
+<li>As a work-around, use
+<code>array[${#array[*]}]=<var>word</var></code> to append a single
+element.</li>
+</ul></li>
+<li><p>Accessing subset arrays of the arguments array is inconsistent if
+<var>pos</var>=0 in <code>${@:<var>pos</var>:<var>len</var>}</code>.</p>
+<ul>
+<li>In Bash 2.x and 3.x, it works as expected, except that argument 0 is
+silently missing. For example <code>${@:0:3}</code> gives arguments 1
+and 2; where <code>${@:1:3}</code> gives arguments 1, 2, and 3. This
+means that if <var>pos</var>=0, then only <var>len</var>-1 arguments are
+given back.</li>
+<li>In Bash 4.0, argument 0 can be accessed, but if <var>pos</var>=0,
+then it only gives back <var>len</var>-1 arguments. So,
+<code>${@:0:3}</code> gives arguments 0 and 1.</li>
+<li>In Bash 4.1 and higher, it works in the way described in the main
+part of this document.</li>
+</ul></li>
+</ul>
+<p>Now, Bash 1.x doesn’t have arrays at all. <code>$@</code> and
+<code>$*</code> work, but using <code>:</code> to select a range of
+elements from them doesn’t. Good thing most boxes have been updated
+since 1996!</p>
+
+</article>
+<footer>
+ <aside class="sponsor"><p>I'd love it if you <a class="em"
+ href="/sponsor/">sponsored me</a>. It will allow me to continue
+ <a class="em" href="/imworkingon/">my work</a> on the GNU/Linux
+ ecosystem. Thanks!</p></aside>
+
+<p>The content of this page is Copyright © 2013 <a href="mailto:lukeshu@lukeshu.com">Luke T. Shumaker</a>.</p>
+<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a> license.</p>
+</footer>
+</body>
+</html>