Spark dataset processing:Scala vs DataFrame API

up vote
-1
down vote

favorite

I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.

In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.

We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.

Apologies if there's not enough context to answer either way.

Thanks in advance!

asked Nov 21 at 2:01

ITnotIT

add a comment |

up vote
-1
down vote

favorite

Apologies if there's not enough context to answer either way.

Thanks in advance!

asked Nov 21 at 2:01

ITnotIT

add a comment |

up vote
-1
down vote

favorite

Apologies if there's not enough context to answer either way.

Thanks in advance!

asked Nov 21 at 2:01

ITnotIT

Apologies if there's not enough context to answer either way.

Thanks in advance!

scala apache-spark

asked Nov 21 at 2:01

ITnotIT

asked Nov 21 at 2:01

ITnotIT

asked Nov 21 at 2:01

ITnotIT

asked Nov 21 at 2:01

ITnotIT

asked Nov 21 at 2:01

ITnotIT

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404318%2fspark-dataset-processingscala-vs-dataframe-api%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

tvc45un 0X 1RyoTG,rrmHmB7gSRkmiIdngQod jbRd9fC90jTetEBHLE16DF8MfmfdmQ5So2iiPWHK,elmXWqSg

搜尋此網誌

Htykuut