Spark dataset processing:Scala vs DataFrame API











up vote
-1
down vote

favorite












I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.



In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.



We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.



Apologies if there's not enough context to answer either way.



Thanks in advance!










share|improve this question


























    up vote
    -1
    down vote

    favorite












    I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.



    In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.



    We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.



    Apologies if there's not enough context to answer either way.



    Thanks in advance!










    share|improve this question
























      up vote
      -1
      down vote

      favorite









      up vote
      -1
      down vote

      favorite











      I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.



      In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.



      We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.



      Apologies if there's not enough context to answer either way.



      Thanks in advance!










      share|improve this question













      I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.



      In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.



      We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.



      Apologies if there's not enough context to answer either way.



      Thanks in advance!







      scala apache-spark






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 at 2:01









      ITnotIT

      42




      42





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404318%2fspark-dataset-processingscala-vs-dataframe-api%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404318%2fspark-dataset-processingscala-vs-dataframe-api%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Berounka

          Sphinx de Gizeh

          Different font size/position of beamer's navigation symbols template's content depending on regular/plain...