Bash: why wait returns prematurely with code 145











up vote
1
down vote

favorite












This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:



#!/bin/bash

#enabling job control
set -m

cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD

#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done

#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done


You can tell something is wrong when you try to run this the following command:



./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"



The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?



(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)










share|improve this question




















  • 2




    #/!bin/bash is not a valid shebang line.
    – l0b0
    2 days ago






  • 3




    145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
    – muru
    2 days ago










  • @l0b0, fixed the typo, which was not a the problem here.
    – Markus L.
    2 days ago










  • @muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
    – Markus L.
    2 days ago








  • 1




    @muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
    – Markus L.
    2 days ago

















up vote
1
down vote

favorite












This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:



#!/bin/bash

#enabling job control
set -m

cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD

#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done

#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done


You can tell something is wrong when you try to run this the following command:



./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"



The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?



(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)










share|improve this question




















  • 2




    #/!bin/bash is not a valid shebang line.
    – l0b0
    2 days ago






  • 3




    145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
    – muru
    2 days ago










  • @l0b0, fixed the typo, which was not a the problem here.
    – Markus L.
    2 days ago










  • @muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
    – Markus L.
    2 days ago








  • 1




    @muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
    – Markus L.
    2 days ago















up vote
1
down vote

favorite









up vote
1
down vote

favorite











This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:



#!/bin/bash

#enabling job control
set -m

cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD

#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done

#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done


You can tell something is wrong when you try to run this the following command:



./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"



The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?



(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)










share|improve this question















This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:



#!/bin/bash

#enabling job control
set -m

cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD

#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done

#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done


You can tell something is wrong when you try to run this the following command:



./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"



The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done.
I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?



(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)







bash shell subprocess signals wait






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 9 hours ago

























asked Nov 20 at 23:36









Markus L.

16319




16319








  • 2




    #/!bin/bash is not a valid shebang line.
    – l0b0
    2 days ago






  • 3




    145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
    – muru
    2 days ago










  • @l0b0, fixed the typo, which was not a the problem here.
    – Markus L.
    2 days ago










  • @muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
    – Markus L.
    2 days ago








  • 1




    @muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
    – Markus L.
    2 days ago
















  • 2




    #/!bin/bash is not a valid shebang line.
    – l0b0
    2 days ago






  • 3




    145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
    – muru
    2 days ago










  • @l0b0, fixed the typo, which was not a the problem here.
    – Markus L.
    2 days ago










  • @muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
    – Markus L.
    2 days ago








  • 1




    @muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
    – Markus L.
    2 days ago










2




2




#/!bin/bash is not a valid shebang line.
– l0b0
2 days ago




#/!bin/bash is not a valid shebang line.
– l0b0
2 days ago




3




3




145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago




145 = 128 + 17 => "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (Bash manual on Exit Status)
– muru
2 days ago












@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago




@l0b0, fixed the typo, which was not a the problem here.
– Markus L.
2 days ago












@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago






@muru, you definitely gave me a good clue here but it doesn't help explaining what happens. From what I can tell wait returns even though the process it is waiting on is not done. So the next question is, where is this fatal signal coming from? And who is sending that "+17" and why? Still quite strange to me.
– Markus L.
2 days ago






1




1




@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago






@muru actually scratch my previous comment, your first comment should be the accepted answer. The signal I am trapping is SIGCHLD (meaning 17). This is where the 17 comes from. Mister y solved, it is actually expected and documented as you pointed out in your link.
– Markus L.
2 days ago














1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:



#!/bin/bash

set -m
trap "echo child_exit" SIGCHLD

function test() {
sleep $1
echo "'sleep $1' just returned now"
}

echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"


If you run this you will get the following output:



linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now


Explanation:



As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.



Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."



So there you have it, what happens in the script above is the following:





  1. sleep 6 starts in the background


  2. sleep 3 starts in the background


  3. wait starts waiting on sleep 6


  4. sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145

  5. my script exits since it does not wait anymore

  6. the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403220%2fbash-why-wait-returns-prematurely-with-code-145%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote



    accepted










    Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:



    #!/bin/bash

    set -m
    trap "echo child_exit" SIGCHLD

    function test() {
    sleep $1
    echo "'sleep $1' just returned now"
    }

    echo sleeping for 6 seconds in the background
    test 6 &
    pid=$!
    echo sleeping for 2 second in the background
    test 2 &
    echo waiting on the 6 second sleep
    wait $pid
    echo "wait return code: $?"


    If you run this you will get the following output:



    linux:~$ sh test2.sh
    sleeping for 6 seconds in the background
    sleeping for 2 second in the background
    waiting on the 6 second sleep
    'sleep 2' just returned now
    child_exit
    wait return code: 145
    lunux:~$ 'sleep 6' just returned now


    Explanation:



    As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
    Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.



    Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."



    So there you have it, what happens in the script above is the following:





    1. sleep 6 starts in the background


    2. sleep 3 starts in the background


    3. wait starts waiting on sleep 6


    4. sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145

    5. my script exits since it does not wait anymore

    6. the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited






    share|improve this answer

























      up vote
      0
      down vote



      accepted










      Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:



      #!/bin/bash

      set -m
      trap "echo child_exit" SIGCHLD

      function test() {
      sleep $1
      echo "'sleep $1' just returned now"
      }

      echo sleeping for 6 seconds in the background
      test 6 &
      pid=$!
      echo sleeping for 2 second in the background
      test 2 &
      echo waiting on the 6 second sleep
      wait $pid
      echo "wait return code: $?"


      If you run this you will get the following output:



      linux:~$ sh test2.sh
      sleeping for 6 seconds in the background
      sleeping for 2 second in the background
      waiting on the 6 second sleep
      'sleep 2' just returned now
      child_exit
      wait return code: 145
      lunux:~$ 'sleep 6' just returned now


      Explanation:



      As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
      Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.



      Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."



      So there you have it, what happens in the script above is the following:





      1. sleep 6 starts in the background


      2. sleep 3 starts in the background


      3. wait starts waiting on sleep 6


      4. sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145

      5. my script exits since it does not wait anymore

      6. the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited






      share|improve this answer























        up vote
        0
        down vote



        accepted







        up vote
        0
        down vote



        accepted






        Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:



        #!/bin/bash

        set -m
        trap "echo child_exit" SIGCHLD

        function test() {
        sleep $1
        echo "'sleep $1' just returned now"
        }

        echo sleeping for 6 seconds in the background
        test 6 &
        pid=$!
        echo sleeping for 2 second in the background
        test 2 &
        echo waiting on the 6 second sleep
        wait $pid
        echo "wait return code: $?"


        If you run this you will get the following output:



        linux:~$ sh test2.sh
        sleeping for 6 seconds in the background
        sleeping for 2 second in the background
        waiting on the 6 second sleep
        'sleep 2' just returned now
        child_exit
        wait return code: 145
        lunux:~$ 'sleep 6' just returned now


        Explanation:



        As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
        Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.



        Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."



        So there you have it, what happens in the script above is the following:





        1. sleep 6 starts in the background


        2. sleep 3 starts in the background


        3. wait starts waiting on sleep 6


        4. sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145

        5. my script exits since it does not wait anymore

        6. the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited






        share|improve this answer












        Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:



        #!/bin/bash

        set -m
        trap "echo child_exit" SIGCHLD

        function test() {
        sleep $1
        echo "'sleep $1' just returned now"
        }

        echo sleeping for 6 seconds in the background
        test 6 &
        pid=$!
        echo sleeping for 2 second in the background
        test 2 &
        echo waiting on the 6 second sleep
        wait $pid
        echo "wait return code: $?"


        If you run this you will get the following output:



        linux:~$ sh test2.sh
        sleeping for 6 seconds in the background
        sleeping for 2 second in the background
        waiting on the 6 second sleep
        'sleep 2' just returned now
        child_exit
        wait return code: 145
        lunux:~$ 'sleep 6' just returned now


        Explanation:



        As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status).
        Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.



        Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."



        So there you have it, what happens in the script above is the following:





        1. sleep 6 starts in the background


        2. sleep 3 starts in the background


        3. wait starts waiting on sleep 6


        4. sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145

        5. my script exits since it does not wait anymore

        6. the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 2 days ago









        Markus L.

        16319




        16319






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403220%2fbash-why-wait-returns-prematurely-with-code-145%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Sphinx de Gizeh

            Dijon

            Guerrita