Bash: why wait returns prematurely with code 145
up vote
1
down vote
favorite
This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)
bash shell subprocess signals wait
|
show 2 more comments
up vote
1
down vote
favorite
This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)
bash shell subprocess signals wait
2
#/!bin/bashis not a valid shebang line.
– l0b0
2 days ago
3
145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago
@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago
@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago
1
@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago
|
show 2 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)
bash shell subprocess signals wait
This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)
bash shell subprocess signals wait
bash shell subprocess signals wait
edited 9 hours ago
asked Nov 20 at 23:36
Markus L.
16319
16319
2
#/!bin/bashis not a valid shebang line.
– l0b0
2 days ago
3
145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago
@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago
@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago
1
@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago
|
show 2 more comments
2
#/!bin/bashis not a valid shebang line.
– l0b0
2 days ago
3
145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago
@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago
@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago
1
@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago
2
2
#/!bin/bash is not a valid shebang line.– l0b0
2 days ago
#/!bin/bash is not a valid shebang line.– l0b0
2 days ago
3
3
145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago
145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago
@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago
@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago
@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago
@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago
1
1
@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago
@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago
|
show 2 more comments
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6starts in the background
sleep 3starts in the background
waitstarts waiting onsleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145- my script exits since it does not wait anymore
- the background
sleep 6terminates hence the "'sleep 6' just returned now" after the script already exited
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6starts in the background
sleep 3starts in the background
waitstarts waiting onsleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145- my script exits since it does not wait anymore
- the background
sleep 6terminates hence the "'sleep 6' just returned now" after the script already exited
add a comment |
up vote
0
down vote
accepted
Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6starts in the background
sleep 3starts in the background
waitstarts waiting onsleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145- my script exits since it does not wait anymore
- the background
sleep 6terminates hence the "'sleep 6' just returned now" after the script already exited
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6starts in the background
sleep 3starts in the background
waitstarts waiting onsleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145- my script exits since it does not wait anymore
- the background
sleep 6terminates hence the "'sleep 6' just returned now" after the script already exited
Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6starts in the background
sleep 3starts in the background
waitstarts waiting onsleep 6
sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145- my script exits since it does not wait anymore
- the background
sleep 6terminates hence the "'sleep 6' just returned now" after the script already exited
answered 2 days ago
Markus L.
16319
16319
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403220%2fbash-why-wait-returns-prematurely-with-code-145%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
#/!bin/bashis not a valid shebang line.– l0b0
2 days ago
3
145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago
@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago
@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago
1
@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago