What is happening during the “down time” when Spark is reading in big data sets on S3?
up vote
1
down vote
favorite
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
add a comment |
up vote
1
down vote
favorite
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
apache-spark
edited 1 hour ago
thebluephantom
2,0832823
2,0832823
asked 4 hours ago
Paul Siegel
478418
478418
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown