CUDA kernel fails to launch after making simple code changes inside the kernel












-1















I have a templated CUDA kernel for calculating and setting values at the interface between 2 computational meshes. The values are calculated using 3 separate contributions, obtained from class member functions with class instances passed to the kernel. If I obtain any one of these contributions alone to set in the output the kernel works. As soon as I add 2 (or all) of these contributions to set in the output the kernel simply does not launch at all.



I've inserted the full kernel code at the end, but I'll try to exemplify the above first.



First define the first 2 contributions:



//contribution 1
VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);
//contribution 2
VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);


Now set output:



Case 1 - kernel launches and sets expected values:



V_pri[cell1_idx] = value1;


Case 2 - kernel launches and sets expected values:



V_pri[cell1_idx] = value2;


Case 3 - kernel does not launch:



V_pri[cell1_idx] = value1 + value2;


I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?



I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.



Here is the full kernel code:



template <typename VType, typename Class_CMBND>
__global__ void set_cmbnd_values_kernel(
cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,
Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,
CMBNDInfoCUDA& contact)
{

int box_idx = blockIdx.x * blockDim.x + threadIdx.x;

cuINT3 box_sizes = contact.cells_box.size();

if (box_idx < box_sizes.dim()) {

int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;
int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;
int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;

cuReal hL = contact.hshift_secondary.norm();
cuReal hR = contact.hshift_primary.norm();
cuReal hmax = (hL > hR ? hL : hR);

int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;

if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;

int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;

cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;

cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);

VType V_2 = V_pri[cell2_idx];
VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);

//a values
VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);
VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);

//b values adjusted with weights
cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;
cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;

//V'' values at cell positions -1 and 1
VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);
VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);

//Formula for V1
V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)
- Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR
+ (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);
}
}


It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.










share|improve this question





























    -1















    I have a templated CUDA kernel for calculating and setting values at the interface between 2 computational meshes. The values are calculated using 3 separate contributions, obtained from class member functions with class instances passed to the kernel. If I obtain any one of these contributions alone to set in the output the kernel works. As soon as I add 2 (or all) of these contributions to set in the output the kernel simply does not launch at all.



    I've inserted the full kernel code at the end, but I'll try to exemplify the above first.



    First define the first 2 contributions:



    //contribution 1
    VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);
    //contribution 2
    VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);


    Now set output:



    Case 1 - kernel launches and sets expected values:



    V_pri[cell1_idx] = value1;


    Case 2 - kernel launches and sets expected values:



    V_pri[cell1_idx] = value2;


    Case 3 - kernel does not launch:



    V_pri[cell1_idx] = value1 + value2;


    I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?



    I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.



    Here is the full kernel code:



    template <typename VType, typename Class_CMBND>
    __global__ void set_cmbnd_values_kernel(
    cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,
    Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,
    CMBNDInfoCUDA& contact)
    {

    int box_idx = blockIdx.x * blockDim.x + threadIdx.x;

    cuINT3 box_sizes = contact.cells_box.size();

    if (box_idx < box_sizes.dim()) {

    int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;
    int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;
    int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;

    cuReal hL = contact.hshift_secondary.norm();
    cuReal hR = contact.hshift_primary.norm();
    cuReal hmax = (hL > hR ? hL : hR);

    int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;

    if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;

    int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;

    cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;

    cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);

    VType V_2 = V_pri[cell2_idx];
    VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);

    //a values
    VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);
    VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);

    //b values adjusted with weights
    cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;
    cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;

    //V'' values at cell positions -1 and 1
    VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);
    VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);

    //Formula for V1
    V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)
    - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR
    + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);
    }
    }


    It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.










    share|improve this question



























      -1












      -1








      -1








      I have a templated CUDA kernel for calculating and setting values at the interface between 2 computational meshes. The values are calculated using 3 separate contributions, obtained from class member functions with class instances passed to the kernel. If I obtain any one of these contributions alone to set in the output the kernel works. As soon as I add 2 (or all) of these contributions to set in the output the kernel simply does not launch at all.



      I've inserted the full kernel code at the end, but I'll try to exemplify the above first.



      First define the first 2 contributions:



      //contribution 1
      VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);
      //contribution 2
      VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);


      Now set output:



      Case 1 - kernel launches and sets expected values:



      V_pri[cell1_idx] = value1;


      Case 2 - kernel launches and sets expected values:



      V_pri[cell1_idx] = value2;


      Case 3 - kernel does not launch:



      V_pri[cell1_idx] = value1 + value2;


      I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?



      I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.



      Here is the full kernel code:



      template <typename VType, typename Class_CMBND>
      __global__ void set_cmbnd_values_kernel(
      cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,
      Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,
      CMBNDInfoCUDA& contact)
      {

      int box_idx = blockIdx.x * blockDim.x + threadIdx.x;

      cuINT3 box_sizes = contact.cells_box.size();

      if (box_idx < box_sizes.dim()) {

      int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;
      int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;
      int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;

      cuReal hL = contact.hshift_secondary.norm();
      cuReal hR = contact.hshift_primary.norm();
      cuReal hmax = (hL > hR ? hL : hR);

      int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;

      if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;

      int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;

      cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;

      cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);

      VType V_2 = V_pri[cell2_idx];
      VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);

      //a values
      VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);
      VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);

      //b values adjusted with weights
      cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;
      cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;

      //V'' values at cell positions -1 and 1
      VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);
      VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);

      //Formula for V1
      V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)
      - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR
      + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);
      }
      }


      It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.










      share|improve this question
















      I have a templated CUDA kernel for calculating and setting values at the interface between 2 computational meshes. The values are calculated using 3 separate contributions, obtained from class member functions with class instances passed to the kernel. If I obtain any one of these contributions alone to set in the output the kernel works. As soon as I add 2 (or all) of these contributions to set in the output the kernel simply does not launch at all.



      I've inserted the full kernel code at the end, but I'll try to exemplify the above first.



      First define the first 2 contributions:



      //contribution 1
      VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);
      //contribution 2
      VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);


      Now set output:



      Case 1 - kernel launches and sets expected values:



      V_pri[cell1_idx] = value1;


      Case 2 - kernel launches and sets expected values:



      V_pri[cell1_idx] = value2;


      Case 3 - kernel does not launch:



      V_pri[cell1_idx] = value1 + value2;


      I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?



      I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.



      Here is the full kernel code:



      template <typename VType, typename Class_CMBND>
      __global__ void set_cmbnd_values_kernel(
      cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,
      Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,
      CMBNDInfoCUDA& contact)
      {

      int box_idx = blockIdx.x * blockDim.x + threadIdx.x;

      cuINT3 box_sizes = contact.cells_box.size();

      if (box_idx < box_sizes.dim()) {

      int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;
      int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;
      int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;

      cuReal hL = contact.hshift_secondary.norm();
      cuReal hR = contact.hshift_primary.norm();
      cuReal hmax = (hL > hR ? hL : hR);

      int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;

      if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;

      int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;

      cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;

      cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);

      VType V_2 = V_pri[cell2_idx];
      VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);

      //a values
      VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);
      VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);

      //b values adjusted with weights
      cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;
      cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;

      //V'' values at cell positions -1 and 1
      VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);
      VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);

      //Formula for V1
      V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)
      - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR
      + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);
      }
      }


      It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.







      cuda scientific-computing






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 24 '18 at 8:47







      qshn

















      asked Nov 23 '18 at 20:44









      qshnqshn

      61




      61
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Right, seems I found the answer to my problem.



          Looking at the generated errors I get "Too many resources requested for launch".



          I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452753%2fcuda-kernel-fails-to-launch-after-making-simple-code-changes-inside-the-kernel%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Right, seems I found the answer to my problem.



            Looking at the generated errors I get "Too many resources requested for launch".



            I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.






            share|improve this answer




























              0














              Right, seems I found the answer to my problem.



              Looking at the generated errors I get "Too many resources requested for launch".



              I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.






              share|improve this answer


























                0












                0








                0







                Right, seems I found the answer to my problem.



                Looking at the generated errors I get "Too many resources requested for launch".



                I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.






                share|improve this answer













                Right, seems I found the answer to my problem.



                Looking at the generated errors I get "Too many resources requested for launch".



                I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 24 '18 at 8:50









                qshnqshn

                61




                61






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452753%2fcuda-kernel-fails-to-launch-after-making-simple-code-changes-inside-the-kernel%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Berounka

                    Different font size/position of beamer's navigation symbols template's content depending on regular/plain...

                    Sphinx de Gizeh