What is happening during the “down time” when Spark is reading in big data sets on S3?

up vote
1
down vote

favorite

I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:

sparkSession.read.json(...)

I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.

What is Spark doing during this period, and how can I help it go faster?

I had two ideas, but both of them appear to be wrong.

My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:

val fileList = loadFiles() sparkSession.read.json(fileList:_*)

This actually caused the "hanging" period to last longer!

My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:

val schema = createSchema() sparksession.read.schema(schema).json(...)

Here the "hanging" period was the same as before, though the computation overall was much quicker.

So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?

edited 1 hour ago

thebluephantom

2,0832823

asked 4 hours ago

Paul Siegel

478418

add a comment |

up vote
1
down vote

favorite

I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:

sparkSession.read.json(...)

What is Spark doing during this period, and how can I help it go faster?

I had two ideas, but both of them appear to be wrong.

val fileList = loadFiles() sparkSession.read.json(fileList:_*)

This actually caused the "hanging" period to last longer!

My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:

val schema = createSchema() sparksession.read.schema(schema).json(...)

Here the "hanging" period was the same as before, though the computation overall was much quicker.

So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?

edited 1 hour ago

thebluephantom

2,0832823

asked 4 hours ago

Paul Siegel

478418

add a comment |

up vote
1
down vote

favorite

I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:

sparkSession.read.json(...)

What is Spark doing during this period, and how can I help it go faster?

I had two ideas, but both of them appear to be wrong.

val fileList = loadFiles() sparkSession.read.json(fileList:_*)

This actually caused the "hanging" period to last longer!

My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:

val schema = createSchema() sparksession.read.schema(schema).json(...)

Here the "hanging" period was the same as before, though the computation overall was much quicker.

So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?

edited 1 hour ago

thebluephantom

2,0832823

asked 4 hours ago

Paul Siegel

478418

I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:

sparkSession.read.json(...)

What is Spark doing during this period, and how can I help it go faster?

I had two ideas, but both of them appear to be wrong.

val fileList = loadFiles() sparkSession.read.json(fileList:_*)

This actually caused the "hanging" period to last longer!

My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:

val schema = createSchema() sparksession.read.schema(schema).json(...)

Here the "hanging" period was the same as before, though the computation overall was much quicker.

So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?

apache-spark

edited 1 hour ago

thebluephantom

2,0832823

asked 4 hours ago

Paul Siegel

478418

edited 1 hour ago

thebluephantom

2,0832823

asked 4 hours ago

Paul Siegel

478418

edited 1 hour ago

thebluephantom

2,0832823

edited 1 hour ago

thebluephantom

2,0832823

edited 1 hour ago

thebluephantom

2,0832823

asked 4 hours ago

Paul Siegel

478418

asked 4 hours ago

Paul Siegel

478418

asked 4 hours ago

Paul Siegel

478418

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut