Spark dataset processing:Scala vs DataFrame API
up vote
-1
down vote
favorite
I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.
In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.
We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.
Apologies if there's not enough context to answer either way.
Thanks in advance!
scala apache-spark
add a comment |
up vote
-1
down vote
favorite
I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.
In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.
We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.
Apologies if there's not enough context to answer either way.
Thanks in advance!
scala apache-spark
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.
In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.
We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.
Apologies if there's not enough context to answer either way.
Thanks in advance!
scala apache-spark
I have an application that processes ~100 million-ish Customer records after gathering them into a Dataset[Customer]. Assuming that a Customer has a Seq[Order] field, there are some operations that look at the Orders in aggregate. For e.g., it might find a "primary" Order and then loop through the Seq to find a matching "secondary" Order, and write out matching primary/secondary pairs. All processing happens within the context of a single Customer.
In another part of the application, we received working Spark SQL that did something similar, and due to deadline requirements we just converted the SQL into Dataframe API code (it's written as a bunch of DataFrame => DataFrame transform functions). To do so, the code flatMaps Customer into a Dataset[Order], and then repeatedly does partitionBy/groupBy to add the necessary columns.
We are trying to consolidate these two approaches, rewriting one if necessary. I think that the former approach is better. The idea of gathering things into a single unit of Customer and then writing basic, easily testable Scala functions to process it just seems clean and straightforward. It reduces dependency on Spark API's as well. Are there any arguments for the other way, i.e. would using Dataframe API perform better? At the very least I think it would need to be rewritten so that we gather Dataset[Order] instead and try to minimize the groupBy's in the transform functions.
Apologies if there's not enough context to answer either way.
Thanks in advance!
scala apache-spark
scala apache-spark
asked Nov 21 at 2:01
ITnotIT
42
42
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53404318%2fspark-dataset-processingscala-vs-dataframe-api%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown