diff options
Diffstat (limited to 'public/bash-arrays.html')
-rw-r--r-- | public/bash-arrays.html | 410 |
1 files changed, 410 insertions, 0 deletions
diff --git a/public/bash-arrays.html b/public/bash-arrays.html new file mode 100644 index 0000000..76132f5 --- /dev/null +++ b/public/bash-arrays.html @@ -0,0 +1,410 @@ +<!DOCTYPE html> +<html lang="en"> +<head> + <meta charset="utf-8"> + <title>Bash arrays — Luke T. Shumaker</title> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <link rel="stylesheet" href="assets/style.css"> + <link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/> +</head> +<body> +<header><a href="/">Luke T. Shumaker</a> » <a href=/blog>blog</a> » bash-arrays</header> +<article> +<h1 id="bash-arrays">Bash arrays</h1> +<p>Way too many people don’t understand Bash arrays. Many of them argue +that if you need arrays, you shouldn’t be using Bash. If we reject the +notion that one should never use Bash for scripting, then thinking you +don’t need Bash arrays is what I like to call “wrong”. I don’t even mean +real scripting; even these little stubs in <code>/usr/bin</code>:</p> +<pre><code>#!/bin/sh +java -jar /…/something.jar $* # WRONG!</code></pre> +<p>Command line arguments are exposed as an array, that little +<code>$*</code> is accessing it, and is doing the wrong thing (for the +lazy, the correct thing is <code>-- "$@"</code>). Arrays in Bash offer a +safe way preserve field separation.</p> +<p>One of the main sources of bugs (and security holes) in shell scripts +is field separation. That’s what arrays are about.</p> +<h2 id="what-field-separation">What? Field separation?</h2> +<p>Field separation is just splitting a larger unit into a list of +“fields”. The most common case is when Bash splits a “simple command” +(in the Bash manual’s terminology) into a list of arguments. +Understanding how this works is an important prerequisite to +understanding arrays, and even why they are important.</p> +<p>Dealing with lists is something that is very common in Bash scripts; +from dealing with lists of arguments, to lists of files; they pop up a +lot, and each time, you need to think about how the list is separated. +In the case of <code>$PATH</code>, the list is separated by colons. In +the case of <code>$CFLAGS</code>, the list is separated by whitespace. +In the case of actual arrays, it’s easy, there’s no special character to +worry about, just quote it, and you’re good to go.</p> +<h2 id="bash-word-splitting">Bash word splitting</h2> +<p>When Bash reads a “simple command”, it splits the whole thing into a +list of “words”. “The first word specifies the command to be executed, +and is passed as argument zero. The remaining words are passed as +arguments to the invoked command.” (to quote <code>bash(1)</code>)</p> +<p>It is often hard for those unfamiliar with Bash to understand when +something is multiple words, and when it is a single word that just +contains a space or newline. To help gain an intuitive understanding, I +recommend using the following command to print a bullet list of words, +to see how Bash splits them up:</p> +<pre><code>printf ' -> %s\n' <var>words…</var><hr> -> word one + -> multiline +word + -> third word +</code></pre> +<p>In a simple command, in absence of quoting, Bash separates the “raw” +input into words by splitting on spaces and tabs. In other places, such +as when expanding a variable, it uses the same process, but splits on +the characters in the <code>$IFS</code> variable (which has the default +value of space/tab/newline). This process is, creatively enough, called +“word splitting”.</p> +<p>In most discussions of Bash arrays, one of the frequent criticisms is +all the footnotes and “gotchas” about when to quote things. That’s +because they usually don’t set the context of word splitting. +<strong>Double quotes (<code>"</code>) inhibit Bash from doing word +splitting.</strong> That’s it, that’s all they do. Arrays are already +split into words; without wrapping them in double quotes Bash re-word +splits them, which is almost <em>never</em> what you want; otherwise, +you wouldn’t be working with an array.</p> +<h2 id="normal-array-syntax">Normal array syntax</h2> +<table> + <caption> + <h1>Setting an array</h1> + <p><var>words…</var> is expanded and subject to word splitting + based on <code>$IFS</code>.</p> + </caption> + <tbody> + <tr> + <td><code>array=(<var>words…</var>)</code></td> + <td>Set the contents of the entire array.</td> + </tr><tr> + <td><code>array+=(<var>words…</var>)</code></td> + <td>Appends <var>words…</var> to the end of the array.</td> + </tr><tr> + <td><code>array[<var>n</var>]=<var>word</var></code></td> + <td>Sets an individual entry in the array, the first entry is at + <var>n</var>=0.</td> + </tr> + </tbody> +</table> +<p>Now, for accessing the array. The most important things to +understanding arrays is to quote them, and understanding the difference +between <code>@</code> and <code>*</code>.</p> +<table> + <caption> + <h1>Getting an entire array</h1> + <p>Unless these are wrapped in double quotes, they are subject to + word splitting, which defeats the purpose of arrays.</p> + <p>I guess it's worth mentioning that if you don't quote them, and + word splitting is applied, <code>@</code> and <code>*</code> + end up being equivalent.</p> + <p>With <code>*</code>, when joining the elements into a single + string, the elements are separated by the first character in + <code>$IFS</code>, which is, by default, a space.</p> + </caption> + <tbody> + <tr> + <td><code>"${array[@]}"</code></td> + <td>Evaluates to every element of the array, as a separate + words.</td> + </tr><tr> + <td><code>"${array[*]}"</code></td> + <td>Evaluates to every element of the array, as a single + word.</td> + </tr> + </tbody> +</table> +<p>It’s really that simple—that covers most usages of arrays, and most +of the mistakes made with them.</p> +<p>To help you understand the difference between <code>@</code> and +<code>*</code>, here is a sample of each:</p> +<table> + <tbody> + <tr><th><code>@</code></th><th><code>*</code></th></tr> + <tr> + <td>Input:<pre><code>#!/bin/bash +array=(foo bar baz) +for item in "${array[@]}"; do + echo " - <${item}>" +done</code></pre></td> + <td>Input:<pre><code>#!/bin/bash +array=(foo bar baz) +for item in "${array[*]}"; do + echo " - <${item}>" +done</code></pre></td> + </tr> + <tr> + <td>Output:<pre><code> - <foo> + - <bar> + - <baz></code></pre></td> + <td>Output:<pre><code> - <foo bar baz><br><br><br></code></pre></td> + </tr> + </tbody> +</table> +<p>In most cases, <code>@</code> is what you want, but <code>*</code> +comes up often enough too.</p> +<p>To get individual entries, the syntax is +<code>${array[<var>n</var>]}</code>, where <var>n</var> starts at 0.</p> +<table> + <caption> + <h1>Getting a single entry from an array</h1> + <p>Also subject to word splitting if you don't wrap it in + quotes.</p> + </caption> + <tbody> + <tr> + <td><code>"${array[<var>n</var>]}"</code></td> + <td>Evaluates to the <var>n</var><sup>th</sup> entry of the + array, where the first entry is at <var>n</var>=0.</td> + </tr> + </tbody> +</table> +<p>To get a subset of the array, there are a few options:</p> +<table> + <caption> + <h1>Getting subsets of an array</h1> + <p>Substitute <code>*</code> for <code>@</code> to get the subset + as a <code>$IFS</code>-separated string instead of separate + words, as described above.</p> + <p>Again, if you don't wrap these in double quotes, they are + subject to word splitting, which defeats the purpose of + arrays.</p> + </caption> + <tbody> + <tr> + <td><code>"${array[@]:<var>start</var>}"</code></td> + <td>Evaluates to the entries from <var>n</var>=<var>start</var> to the end + of the array.</td> + </tr><tr> + <td><code>"${array[@]:<var>start</var>:<var>count</var>}"</code></td> + <td>Evaluates to <var>count</var> entries, starting at + <var>n</var>=<var>start</var>.</td> + </tr><tr> + <td><code>"${array[@]::<var>count</var>}"</code></td> + <td>Evaluates to <var>count</var> entries from the beginning of + the array.</td> + </tr> + </tbody> +</table> +<p>Notice that <code>"${array[@]}"</code> is equivalent to +<code>"${array[@]:0}"</code>.</p> +<table> + <caption> + <h1>Getting the length of an array</h1> + <p>The is the only situation with arrays where quoting doesn't + make a difference.</p> + <p>True to my earlier statement, when unquoted, there is no + difference between <code>@</code> and <code>*</code>.</p> + </caption> + <tbody> + <tr> + <td> + <code>${#array[@]}</code> + <br>or<br> + <code>${#array[*]}</code> + </td> + <td> + Evaluates to the length of the array + </td> + </tr> + </tbody> +</table> +<h2 id="argument-array-syntax">Argument array syntax</h2> +<p>Accessing the arguments is mostly that simple, but that array doesn’t +actually have a variable name. It’s special. Instead, it is exposed +through a series of special variables (normal variables can only start +with letters and underscore), that <em>mostly</em> match up with the +normal array syntax.</p> +<p>Setting the arguments array, on the other hand, is pretty different. +That’s fine, because setting the arguments array is less useful +anyway.</p> +<table> + <caption> + <h1>Accessing the arguments array</h1> + <aside>Note that for values of <var>n</var> with more than 1 + digit, you need to wrap it in <code>{}</code>. + Otherwise, <code>"$10"</code> would be parsed + as <code>"${1}0"</code>.</aside> + </caption> + <tbody> + <tr><th colspan=2>Individual entries</th></tr> + <tr><td><code>${array[0]}</code></td><td><code>$0</code></td></tr> + <tr><td><code>${array[1]}</code></td><td><code>$1</code></td></tr> + <tr><td colspan=2 style="text-align:center">…</td></tr> + <tr><td><code>${array[9]}</code></td><td><code>$9</code></td></tr> + <tr><td><code>${array[10]}</code></td><td><code>${10}</code></td></tr> + <tr><td colspan=2 style="text-align:center">…</td></tr> + <tr><td><code>${array[<var>n</var>]}</code></td><td><code>${<var>n</var>}</code></td></tr> + <tr><th colspan=2>Subset arrays (array)</th></tr> + <tr><td><code>"${array[@]}"</code></td><td><code>"${@:0}"</code></td></tr> + <tr><td><code>"${array[@]:1}"</code></td><td><code>"$@"</code></td></tr> + <tr><td><code>"${array[@]:<var>pos</var>}"</code></td><td><code>"${@:<var>pos</var>}"</code></td></tr> + <tr><td><code>"${array[@]:<var>pos</var>:<var>len</var>}"</code></td><td><code>"${@:<var>pos</var>:<var>len</var>}"</code></td></tr> + <tr><td><code>"${array[@]::<var>len</var>}"</code></td><td><code>"${@::<var>len</var>}"</code></td></tr> + <tr><th colspan=2>Subset arrays (string)</th></tr> + <tr><td><code>"${array[*]}"</code></td><td><code>"${*:0}"</code></td></tr> + <tr><td><code>"${array[*]:1}"</code></td><td><code>"$*"</code></td></tr> + <tr><td><code>"${array[*]:<var>pos</var>}"</code></td><td><code>"${*:<var>pos</var>}"</code></td></tr> + <tr><td><code>"${array[*]:<var>pos</var>:<var>len</var>}"</code></td><td><code>"${*:<var>pos</var>:<var>len</var>}"</code></td></tr> + <tr><td><code>"${array[*]::<var>len</var>}"</code></td><td><code>"${*::<var>len</var>}"</code></td></tr> + <tr><th colspan=2>Array length</th></tr> + <tr><td><code>${#array[@]}</code></td><td><code>$#</code> + 1</td></tr> + <tr><th colspan=2>Setting the array</th></tr> + <tr><td><code>array=("${array[0]}" <var>words…</var>)</code></td><td><code>set -- <var>words…</var></code></td></tr> + <tr><td><code>array=("${array[0]}" "${array[@]:2}")</code></td><td><code>shift</code></td></tr> + <tr><td><code>array=("${array[0]}" "${array[@]:<var>n+1</var>}")</code></td><td><code>shift <var>n</var></code></td></tr> + </tbody> +</table> +<p>Did you notice what was inconsistent? The variables <code>$*</code>, +<code>$@</code>, and <code>$#</code> behave like the <var>n</var>=0 +entry doesn’t exist.</p> +<table> + <caption> + <h1>Inconsistencies</h1> + </caption> + <tbody> + <tr> + <th colspan=3><code>@</code> or <code>*</code></th> + </tr><tr> + <td><code>"${array[@]}"</code></td> + <td>→</td> + <td><code>"${array[@]:0}"</code></td> + </tr><tr> + <td><code>"${@}"</code></td> + <td>→</td> + <td><code>"${@:1}"</code></td> + </tr><tr> + <th colspan=3><code>#</code></th> + </tr><tr> + <td><code>"${#array[@]}"</code></td> + <td>→</td> + <td>length</td> + </tr><tr> + <td><code>"${#}"</code></td> + <td>→</td> + <td>length-1</td> + </tr> + </tbody> +</table> +<p>These make sense because argument 0 is the name of the script—we +almost never want that when parsing arguments. You’d spend more code +getting the values that it currently gives you.</p> +<p>Now, for an explanation of setting the arguments array. You cannot +set argument <var>n</var>=0. The <code>set</code> command is used to +manipulate the arguments passed to Bash after the fact—similarly, you +could use <code>set -x</code> to make Bash behave like you ran it as +<code>bash -x</code>; like most GNU programs, the <code>--</code> tells +it to not parse any of the options as flags. The <code>shift</code> +command shifts each entry <var>n</var> spots to the left, using +<var>n</var>=1 if no value is specified; and leaving argument 0 +alone.</p> +<h2 id="but-you-mentioned-gotchas-about-quoting">But you mentioned +“gotchas” about quoting!</h2> +<p>But I explained that quoting simply inhibits word splitting, which +you pretty much never want when working with arrays. If, for some odd +reason, you do what word splitting, then that’s when you don’t quote. +Simple, easy to understand.</p> +<p>I think possibly the only case where you do want word splitting with +an array is when you didn’t want an array, but it’s what you get +(arguments are, by necessity, an array). For example:</p> +<pre><code># Usage: path_ls PATH1 PATH2… +# Description: +# Takes any number of PATH-style values; that is, +# colon-separated lists of directories, and prints a +# newline-separated list of executables found in them. +# Bugs: +# Does not correctly handle programs with a newline in the name, +# as the output is newline-separated. +path_ls() { + local IFS dirs + IFS=: + dirs=($@) # The odd-ball time that it needs to be unquoted + find -L "${dirs[@]}" -maxdepth 1 -type f -executable \ + -printf '%f\n' 2>/dev/null | sort -u +}</code></pre> +<p>Logically, there shouldn’t be multiple arguments, just a single +<code>$PATH</code> value; but, we can’t enforce that, as the array can +have any size. So, we do the robust thing, and just act on the entire +array, not really caring about the fact that it is an array. Alas, there +is still a field-separation bug in the program, with the output.</p> +<h2 id="i-still-dont-think-i-need-arrays-in-my-scripts">I still don’t +think I need arrays in my scripts</h2> +<p>Consider the common code:</p> +<pre><code>ARGS=' -f -q' +… +command $ARGS # unquoted variables are a bad code-smell anyway</code></pre> +<p>Here, <code>$ARGS</code> is field-separated by <code>$IFS</code>, +which we are assuming has the default value. This is fine, as long as +<code>$ARGS</code> is known to never need an embedded space; which you +do as long as it isn’t based on anything outside of the program. But +wait until you want to do this:</p> +<pre><code>ARGS=' -f -q' +… +if [[ -f "$filename" ]]; then + ARGS+=" -F $filename" +fi +… +command $ARGS</code></pre> +<p>Now you’re hosed if <code>$filename</code> contains a space! More +than just breaking, it could have unwanted side effects, such as when +someone figures out how to make +<code>filename='foo --dangerous-flag'</code>.</p> +<p>Compare that with the array version:</p> +<pre><code>ARGS=(-f -q) +… +if [[ -f "$filename" ]]; then + ARGS+=(-F "$filename") +fi +… +command "${ARGS[@]}"</code></pre> +<h2 id="what-about-portability">What about portability?</h2> +<p>Except for the little stubs that call another program with +<code>"$@"</code> at the end, trying to write for multiple shells +(including the ambiguous <code>/bin/sh</code>) is not a task for mere +mortals. If you do try that, your best bet is probably sticking to +POSIX. Arrays are not POSIX; except for the arguments array, which is; +though getting subset arrays from <code>$@</code> and <code>$*</code> is +not (tip: use <code>set --</code> to re-purpose the arguments +array).</p> +<p>Writing for various versions of Bash, though, is pretty do-able. +Everything here works all the way back in bash-2.0 (December 1996), with +the following exceptions:</p> +<ul> +<li><p>The <code>+=</code> operator wasn’t added until Bash 3.1.</p> +<ul> +<li>As a work-around, use +<code>array[${#array[*]}]=<var>word</var></code> to append a single +element.</li> +</ul></li> +<li><p>Accessing subset arrays of the arguments array is inconsistent if +<var>pos</var>=0 in <code>${@:<var>pos</var>:<var>len</var>}</code>.</p> +<ul> +<li>In Bash 2.x and 3.x, it works as expected, except that argument 0 is +silently missing. For example <code>${@:0:3}</code> gives arguments 1 +and 2; where <code>${@:1:3}</code> gives arguments 1, 2, and 3. This +means that if <var>pos</var>=0, then only <var>len</var>-1 arguments are +given back.</li> +<li>In Bash 4.0, argument 0 can be accessed, but if <var>pos</var>=0, +then it only gives back <var>len</var>-1 arguments. So, +<code>${@:0:3}</code> gives arguments 0 and 1.</li> +<li>In Bash 4.1 and higher, it works in the way described in the main +part of this document.</li> +</ul></li> +</ul> +<p>Now, Bash 1.x doesn’t have arrays at all. <code>$@</code> and +<code>$*</code> work, but using <code>:</code> to select a range of +elements from them doesn’t. Good thing most boxes have been updated +since 1996!</p> + +</article> +<footer> + <aside class="sponsor"><p>I'd love it if you <a class="em" + href="/sponsor/">sponsored me</a>. It will allow me to continue + <a class="em" href="/imworkingon/">my work</a> on the GNU/Linux + ecosystem. Thanks!</p></aside> + +<p>The content of this page is Copyright © 2013 <a href="mailto:lukeshu@lukeshu.com">Luke T. Shumaker</a>.</p> +<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a> license.</p> +</footer> +</body> +</html> |