Wait channels as debugging aid for threading bugs

Aug 14, 2015
Debugging
Eugene Zaikonnikov

..In every system programmer’s career inevitably comes a moment when they encounter a non-obvious, asymptomatic, poorly reproducible bug in critical software running on remote site without remote debug instrumentation.

Well there was a symptom. The device would put up a diagnostic message amounting to “oops” and stop responding to stimuli. The log file would stop growing while the generating thread is still around. The process would not react to SIGQUIT (denying me a core dump). In other words, a very good hint of a deadlock.

To top that, the issue would appear only on few devices out of a hundred, and after running for at least few weeks.

In cases like this, however, there’s still a great tool at disposal: wchan (wait channel) of procfs. Per the manpage:

 (35) wchan  %lu
      This  is the "channel" in which the process is waiting.  It is the address
	  of a location in the kernel where the process is sleeping.  The corresponding
	  symbolic name can be found in /proc/[pid]/wchan.

Each process has also associated tasks, in this case threads, found in /proc/[pid]/tasks/, which is again a list of PIDs with a wchan field associated. I start with skimming through the application threads on an affected system, looking for something out of ordinary.

The application is structured around a central SYSV message queue, where a state machine dispatches the messages and notifies the subscribed functional modules. Some of the modules are separate threads, but a number of simpler tasks are executed directly in the dispatcher thread.

Looking at the one of the threads reveals it is blocked in do_msgsnd(): the kernel call for sending SysV IPC message:

 # cat /proc/306/task/311/wchan 
 do_msgsnd

Now, sending a message can block in one case only: the queue is full. Again, we can check that via procfs.

 # cat /proc/sysvipc/msg
       key      msqid perms      cbytes       qnum lspid lrpid 
    151273          0     0       16384        128   306   306

Yep, qnum is at 128, the limit on the system, so the queue is full. The only place that consumes the queue is the dispatcher thread, and the most likely way for that part to lock up is for notification loop to block on some call.

After checking the wchan value of the dispatcher thread versus one on a healthy system, the likely offender is found:

 # cat /proc/306/task/317/wchan 
 unix_wait_for_peer

This is the kernel call handling the opening of Unix socket. The only place where communication over AF_UNIX happens is the code forwarding system status over to snmpd. Upon examination it turns out cough the socket is being opened and closed for every message session. Normally this shouldn’t be a problem, but the system runs with pretty ancient kernel, affected by certain bugs like CVE-2009-3621, and this is just not a good practice in general.

The conclusion is, there is a way out of most desperate situations. If you think you are stuck, remember it’s nothing compared to what Apollo 13 crew had to go through.