This is a problem found during my work in some company. I would like to share it and give some solutions or debug processes.
The Problem
The original problem is that a complex testing framework hangs for no reason. In my first step for debugging, I immedialy checked the process tree using pstree -ap <pid>
. It shows that the process hang on its ssh
subprocess. The command is something like ssh <username>@<ip> command
and is just a normal ssh execution command. It should never go on forever, since this command usually needs only 5 seconds.
It’s hard, but I finally check the code and found the Python script for making this subprocess. The code is as follows.
import subprocess
# ... code ...
try:
proc = subprocess.Popen(["ssh", "<username>@<ip>", "<command>"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
proc.wait()
output = proc.stdout.read()
retcode = proc.returncode
return retcode, output
except:
# ... code ...
I tested the code on my machine, and it worked fine! How is it possible! Why will it hang there? Especially since the process is not over, it hangs at the wait
call. Why is that?
Debugging
Well it’s easy to guess that the problem in in the output stream. My first guess is that it needs to wait for all output in the PIPE
buffer to be cleared, but since the code works on other places, this seems to be implausible. I decided to check explicitly for the buffer.
Fortunately, we have /dev/urandom
to test for outputs. For controlling the output size, I used both head
and dd
(for controlling output more precisely). Let’s write a code and test it. Note that Python 3 is used for this test. Both Linux and Windows with bash can be used for this test.
import subprocess
block_size = 65536
proc = subprocess.Popen(["bash", "-c", f"dd if=/dev/urandom bs={block_size} count=1 2>/dev/null"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
proc.wait()
output = proc.stdout.read()
retcode = proc.returncode
print(len(output))
If we run this test, we can see 65536
printed as expected. However, let’s change block_size
to 65537, or any value above that. Now we have a program that will never stop running! Now we can say that the buffer for subprocess.PIPE
is exactly 65536. Therefore, we now know why the problem appears. The process is waiting for the buffer to clear up and only after it can stop running. Also, as long as the total amount of received bytes exceeds 65536, the program will hang, even if block size is so small and count is large, i.e., dd if=/dev/urandom bs=1 count=65537 2>/dev/null
will hang as well.
Solution
The solution is so easy that I am surprised. Just swapping two lines and the problem is solved:
import subprocess
block_size = 65536
proc = subprocess.Popen(["bash", "-c", f"dd if=/dev/urandom bs={block_size} count=1 2>/dev/null"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
output = proc.stdout.read()
proc.wait() # put it here!
retcode = proc.returncode
print(len(output))
Now, the program will first read all the bytes until the end, and then wait for the program to exit instead of waiting for someone to take away the output from the buffer! It’s a hard problem with an easy solution. I hope it will never happen again.
Testing is hard, but it’s even harder to test a testing framework!
Comment below with different methods!
GitHub Issues