What is happening during the “down time” when Spark is reading in big data sets on S3?











up vote
1
down vote

favorite












I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



sparkSession.read.json(...)



I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




What is Spark doing during this period, and how can I help it go faster?




I had two ideas, but both of them appear to be wrong.



My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



val fileList = loadFiles()
sparkSession.read.json(fileList:_*)



This actually caused the "hanging" period to last longer!



My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



val schema = createSchema()
sparksession.read.schema(schema).json(...)



Here the "hanging" period was the same as before, though the computation overall was much quicker.



So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?










share|improve this question




























    up vote
    1
    down vote

    favorite












    I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



    sparkSession.read.json(...)



    I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




    What is Spark doing during this period, and how can I help it go faster?




    I had two ideas, but both of them appear to be wrong.



    My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



    val fileList = loadFiles()
    sparkSession.read.json(fileList:_*)



    This actually caused the "hanging" period to last longer!



    My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



    val schema = createSchema()
    sparksession.read.schema(schema).json(...)



    Here the "hanging" period was the same as before, though the computation overall was much quicker.



    So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?










    share|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



      sparkSession.read.json(...)



      I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




      What is Spark doing during this period, and how can I help it go faster?




      I had two ideas, but both of them appear to be wrong.



      My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



      val fileList = loadFiles()
      sparkSession.read.json(fileList:_*)



      This actually caused the "hanging" period to last longer!



      My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



      val schema = createSchema()
      sparksession.read.schema(schema).json(...)



      Here the "hanging" period was the same as before, though the computation overall was much quicker.



      So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?










      share|improve this question















      I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



      sparkSession.read.json(...)



      I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




      What is Spark doing during this period, and how can I help it go faster?




      I had two ideas, but both of them appear to be wrong.



      My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



      val fileList = loadFiles()
      sparkSession.read.json(fileList:_*)



      This actually caused the "hanging" period to last longer!



      My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



      val schema = createSchema()
      sparksession.read.schema(schema).json(...)



      Here the "hanging" period was the same as before, though the computation overall was much quicker.



      So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?







      apache-spark






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 1 hour ago









      thebluephantom

      2,0832823




      2,0832823










      asked 4 hours ago









      Paul Siegel

      478418




      478418





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Sphinx de Gizeh

          Dijon

          Langue