Menu Home

Reservoir Sampling with bash

So I have a fairly extensive .bashrc file, and one thing you might notice while scrolling through it, is that I have a whole bunch of “step-in” functions. These are functions that replicate some other command for some reason. Primarily this is because I work on both Linux and Solaris, and Solaris is rather lacking. Usually these functions are there to prop up other functions, and quite often I’ll evaluate and rebuild some of the code. Usually I might look at my code and think “what the fuck was I doing? It’d be better to do it this way”, or I might learn a new trick that I immediately merge in while it’s fresh in memory (unintentional pun).

Now, I get that a lot of people would rather split their functions out into separate scripts and maybe use some form of dotfile management. This doesn’t work for me, sadly, as I’m talking about work i.e. client servers with no internet access. And you can’t just login to a random HP-UX host and git clone your dotfiles. If you can do that, then you can already install things, and you may as well just install the GNU coreutils and go on with your life. Dotfile management is a cool concept, don’t get me wrong, but it’s not so useful for a corporate/enterprise level *nix sysadmin – the better option IMHO is the one that I use: a monolithic .bashrc that acts as my digital toolbox.

Recently, I made major performance gains with my passphrase generator, genphrase. I realised that my previous approach was a bit stupid. For example:

while (( n < PphraseNum )); do
  wordArray=( "${SeedWord}" )
  for line in $(rand -N "${PphraseWords}" -M "$(wc -l < ~/.pwords.dict)"); do
    wordArray+=( $(printline "${line}" ~/.pwords.dict | capitalise) )
  for word in "${wordArray[@]}"; do
    printf '%s\n' "${RANDOM} ${word}"
  done | sort | awk '{print $2}' | tr -d "\n"
  printf "\n"
  ((n = n + 1))
done | "${PphraseCols}"

There’s a lot going on here, but basically we’re taking a seed word (or not) and adding it to an array. Then we’re randomly selecting some words from a dictionary and capitalising them. Then we’re assigning a number to each word using $RANDOM, followed by a sort and then straightening everything out. Rinse and repeat for any number of words. Churning through a loop like this is expensive, especially with multiple calls to rand, printline, capitalise, sort, awk and tr. Not to mention opening and closing the dictionary multiple times. This is really stupid – in my defence, though, genphrase was initially written to be compatible down to bash 2.04 on Solaris 8, so a lot of stupid workarounds were implemented.

I realised that if we know how many passphrases we want, and we know how many words we want in each passphrase, well… then we can simply use two arrays. Our dictionary file is small enough to easily fit into memory, and mapping it into an array is near-instant. Then we generate another array with sufficient random integers between 1 and the size of the dictionary file. Then simply walk through the array of random integers and pluck out the relative elements from the dictionary array. A simplified version looks something like this:

mapfile -t dictArray < ~/.pwords.dict
mapfile -t numArray < <(rand -M "${#dictArray[@]}" -r -N "${totalWords}")

hiWord=$(( PphraseWords - 1 ))

while (( hiWord <= totalWords )); do
   printf '%s\n' "${RANDOM} ${SeedWord}"
   for randInt in "${numArray[@]:loWord:PphraseWords}"; do
     printf '%s\n' "${RANDOM} ${dictArray[randInt]^}"
 } | sort | awk '{print $2}' | paste -sd '\0' -
 loWord=$(( hiWord + 1 ))
 hiWord=$(( hiWord + PphraseWords ))
 done | "${PphraseCols}"

And the difference was dramatic. 1000x 3-word phrases went from 38 seconds to 5 seconds using the above bash method. Even more dramatic was on a test Solaris host, where the same test went from approximately 48 minutes (!!!) down to 1 minute 30!

This is relevant to the post title, I promise.

Right now, I’m working on upgrading my shuf step-in function. It was half-arsed, rubbish, slow and lacked most of the arguments. It was nowhere near a drop-in alternative to the real thing. And, crucially, it was incapable of handling n-size inputs, where n is either too large for memory or unknown.

Unfortunately, my brain is overloaded like a busy train station with Dad stuff and work stuff, so when I look at examples of Reservoir Sampling, even Wikipedia’s effort, my eyes just glaze over. I really struggled to wrap my head around it, so instead I took the approach of going with the Knuth/Fisher-Yates family of shuffling algorithms, which I’m already familiar with, and tacking on a reservoir. Now, that in and of itself isn’t too hard, but the issue I’ve had has been with how to handle either stdin or a file. Usually, you’d use this form:

while read -r line; do
done < "${1:-/dev/stdin}"

For those who haven’t seen that kind of variable in bash before, it essentially means “use $1, and if it’s not assigned, default to /dev/stdin.” But in our case, we’re grabbing the first chunk of the input using mapfile to quickly fill our reservoir. Then we’re following up with a while read loop until the input is exhausted. This works fine with stdin, but with a file, you have to manage it in two chunks – the first chunk to fill the reservoir and the second chunk for the while read loop, starting at reservoir size + 1.

I tried multiple ways to get this to work and they were all fragile, and then I remembered a trick from my rand script: cat - will stream its stdin, so we can very simply group our mapfile and while read loop, and then cat either a file or stdin to that group of commands. So a proof of concept script for this approach looks like this:

#set -x

cat "${1:--}" | {
mapfile -n 5 -t lineArray

while read -r inLine; do
  randInt=$(rand -M "${#lineArray[@]}")
  (( randInt-- ))
  printf '%s\n' "${lineArray[randInt]}"

randArray=( $(rand -M "${#lineArray[@]}" -N "${#lineArray[@]}") )

for randInt in ${randArray[@]}; do
  (( randInt-- ))
  printf '%s\n' "${lineArray[randInt]}"

Walking through it; cat either a file or stdin to our group of commands. mapfile grabs the first 5 lines and puts it into lineArray[@] (this is our ‘reservoir’, I’ll reference them interchangeably). Because it’s a test script we’ll give it small inputs, so 5 elements in the array is fine. And, as an example, let’s feed it the alphabet.

After the reservoir is filled, lineArray[@] looks like this:

'([0]="a" [1]="b" [2]="c" [3]="d" [4]="e")'

In bash, arrays are indexed starting at 0, so index 0 is the letter a (i.e. [0]="a"), index 1 is the letter b (i.e. [1]="b") and so on. The while read loop takes over, we generate a random integer between 1 and the size of lineArray[@], in this case 5. So let’s say that random integer is 3, we subtract 1 from it to compensate for the 0-starting point of the array, and we obviously get the number 2. Then we print out that index from the reservoir, i.e. [2]="c". Next we insert the next line of input, which is obviously the letter f. lineArray[@] now looks like this:

'([0]="a" [1]="b" [2]="f" [3]="d" [4]="e")'

Rinse and repeat until the stream is exhausted. Now, once the stream is exhausted, we still have elements in our reservoir, so we spin up another array of random integers the same size as our reservoir and… you can see where this is going? Told you my genphrase backstory was relevant 🙂

Ok, so let’s demonstrate this proof of concept script, which I’ve called readtest. First, I’ll generate a test file with the letters of the alphabet, to prove that file reading works:

$ printf '%s\n' {a..z} > alphabet

I’ll use paste here to put the output all on one line for your readability:

$ readtest alphabet | paste -sd ' '
c d g h a f j i e b m n l o s p r v k x y t u w z q

And the same, but using stdin:

$ readtest alphabet < <(printf '%s\n' {a..z}) | paste -sd ' '
b a g h f j d c e i m n l o p t k q v x y s r w u z

And it gives different results:

$ readtest alphabet < <(printf '%s\n' {a..z}) | paste -sd ' '
d a b g f c h i m j o l q k e p u r t s y w z x n v

It’s obvious even at this low scale that the reservoir size of 5 is insufficient, as the letters tend to be nearer to their actual location in the alphabetical order. If we put the reservoir size up to 10, we get more pleasing results:

$ readtest alphabet < <(printf '%s\n' {a..z}) | paste -sd ' '
a d e g i c k p o n h u v q x b w j z l m y r f t s
$ readtest alphabet < <(printf '%s\n' {a..z}) | paste -sd ' '
e b c h a k n l g r s q j u m i y d w f x v o z p t

Obviously this needs a little bit more work – one issue to contend with is the pipe buffering, and I don’t know if this completely/technically classifies as reservoir sampling per se, or simply a Fisher-Yates variant + a reservoir.  This has allowed me to complete the first draft of my shuf function though, and as I’ve found no other shell based instances of reservoir sampling, I figured it’s best to put this out there.


UPDATE: cat is an external tool and generally you should try to avoid calling external tools. The approach for not-cat‘ing this is to use a file descriptor. One of my rules-of-thumb for shell scripting is that if you’re reaching for a custom file descriptor, you should probably reach for another language. Example implementation as follows, it doesn’t seem to have any appreciable performance difference, however:

#set -x

if [[ "$1" ]]; then
  exec 6< "$1"
  exec 6<&0

mapfile -u 6 -n "${2:-100}" -t lineArray

while read -r -u 6 inLine; do
  randInt=$(rand -M "${#lineArray[@]}")
  (( randInt-- ))
  printf '%s\n' "${lineArray[randInt]}"

randArray=( $(rand -M "${#lineArray[@]}" -N "${#lineArray[@]}") )

for randInt in "${randArray[@]}"; do
  (( randInt-- ))
  printf '%s\n' "${lineArray[randInt]}"

exec 0<&6 6<&-

Probably the performance impact is caused by repeated calls to rand...

Categories: Geeking Out Lunix Lunacy