• Alexios
  • Linux
  • unix
  • bash
  • shell

Sometimes you encounter weird problems, and drilling down reveals something weird and interesting which of course makes sense in retrospect. This is an account of one of them.

Some context: this is a backup helper script, in bash. It uses gpg to decrypt multiple backups in parallel, to take better advantage of modern multi-threaded processors—using GNU parallel, in case you're curious. Because there are hundreds of files to decrypt, the scripts loads its own gpg-agent so the operator won't need to enter a long passphrase multiple times.

The problem itself can be demonstrated with a minuscule subset of this script. Call this script foo.sh:


killall -9 gpg-agent # We always start a new agent.
exec 4>&1            # Simulate script logging
gpg-agent --daemon
echo 'Done!'

Now, if you run foo.sh on its own:

$ ./foo.sh
+ exec
+ gpg-agent --daemon
+ echo 'Done!'
$ █

it works fine: you get back to the shell prompt, and if you check the process table, you'll find gpg-agent running.

Now try piping the output to something like tee:

$ ./foo.sh | tee foo.log
+ exec
+ gpg-agent --daemon
+ echo 'Done!'

Where's the prompt gone? The first person who encountered this problem discovered two fixes empirically:

  1. Breaking out of the script with Ctrl-C.
  2. Switching to another session and invoking the script's clean-up mode, which kills gpg-agent.

In both cases, the script ended and the user got the shell prompt back. Even without those hints, and given I'm probably half as clever as you, you've already spotted the problem. gpg-agent is somehow holding stdout open, and since tee's standard input is still open, it's still waiting for the script's output.

There are a number of ways to diagnose this issue:

  1. Use lsof -c gpg-agent to see what file descriptors the agent has open.
  2. On Linux, take a gander inside /proc/PID/fd: (where PID is the process ID of gpg-agent)
  3. Use strace to see what the agent is doing on start-up.
  4. Check out the source.

My immediate instinct (knowing the script intimately) was to check inside /proc/ for file descriptor 4. In there, I found something like this:

$ sudo ls -la /proc/1696/fd
total 0
dr-x------ 2 alexios alexios  0 Mar  5 02:58 .
dr-xr-xr-x 8 alexios alexios  0 Mar  5 02:57 ..
lr-x------ 1 alexios alexios 64 Mar  5 02:58 0 -> /dev/null
l-wx------ 1 alexios alexios 64 Mar  5 02:58 1 -> /dev/null
l-wx------ 1 alexios alexios 64 Mar  5 02:58 2 -> /dev/null
lrwx------ 1 alexios alexios 64 Mar  5 02:58 3 -> socket:[47834223]
l-wx------ 1 alexios alexios 64 Mar  5 02:58 4 -> pipe:[47834101]
lrwx------ 1 alexios alexios 64 Mar  5 02:58 5 -> socket:[47834225]
lrwx------ 1 alexios alexios 64 Mar  5 02:58 6 -> socket:[47834227]
lrwx------ 1 alexios alexios 64 Mar  5 02:58 7 -> socket:[47834229]
lr-x------ 1 alexios alexios 64 Mar  5 02:58 8 -> inotify
lr-x------ 1 alexios alexios 64 Mar  5 02:58 9 -> inotify
$ █

Note that the dæmon has closed standard input, output and error (FDs 0, 1 and 2). However, FD 4 is still open, and even allocated to a pipe. Obviously. this daemon is misbehaving somehow.

The answer lies in what FD 4 is: the original script uses a library that provides logging facilities to standard output, standard error, or a log file. By default, log output goes to standard output, file descriptor 1, so the library executes this when the script starts up:

exec 4>&1

This creates FD 4, and redirects it to FD 1 (the standard output). The script in question doesn't use a log file, so FD 4 remains at the default state.

When a dæmon starts up, it's customary to close all file descriptors and start a new session using the setsid() system call. This detaches it from the terminal and makes PID 1 (init) become its parent. However, gpg-agent needs the terminal so its ancillary ‘PIN’ entry programs can ask the user for a passphrase (there are GUI PIN entry programs too, but gpg-agent keeps the terminal open regardless). So, it doesn't close all file descriptors.

Why pipe the output to a log file if you expect to enter a passphrase?
I'm not sure! You'll have to ask the script's end user for that. Personally, I'd run an interactive script inside a screen(1) or tmux(1) with logging enabled, or just use script(1). I'm pretty sure a curses-based program would still work, too. But there's no accounting for taste, and maybe they were using the text-only PIN entry helper.

Solving the problem at the source is trivial, once we've found the root of the problem. We modify the script so it no longer passes FD 4 on to gpg-agent:

gpg-agent --daemon 4>/dev/null

And then it works like a charm! So, what can we learn from this?

  • There's no such thing as ‘one size fits all’! Even a thoroughly tested library can exhibit surprising behaviours.
  • Not all dæmons are created equal. The traditional Unix runtime model might not be a perfect fit for all workflows.
  • Know your diagnostic tools! /proc is a little unfashionable these days, decades after it was first introduced to Linux, but never underestimate how useful it can be!
  • There's no substitute for knowing the fundaments of how Unix works, even if you're diagnosing user-space issues.

And most importantly: even a relatively small, easily solved issue like this can be a learning experience!