Translate function from python to pyspark

I would like to compare two pyspark dataframes and get the differences in a new table.

I tested the way to do it using python:

my dataframe

data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, -999999, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df3

my reference dataframe

data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, 202, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df_ref3

Compare rows:

def compare_datasets(df, df_ref):

    ne_stacked = (df != df_ref).stack()

    changed = ne_stacked[ne_stacked]

    changed.index.names = ['id', 'col']

    difference_locations = np.where(df != df_ref)

    changed_from = df.values[difference_locations]

    changed_to = df_ref.values[difference_locations]

    error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

    return error_test



compare_datasets(df3, df_ref3)

However, I would like to do this in pyspark. Does someone have an idea?

Thanks!

edited Nov 23 '18 at 10:41

Ali AzG

581515

asked Nov 23 '18 at 9:57

MVachelard

333

It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
– Ali AzG
Nov 23 '18 at 10:20

I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
– MVachelard
Nov 23 '18 at 11:27

add a comment |

I would like to compare two pyspark dataframes and get the differences in a new table.

I tested the way to do it using python:

my dataframe

data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, -999999, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df3

my reference dataframe

data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, 202, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df_ref3

Compare rows:

def compare_datasets(df, df_ref):

    ne_stacked = (df != df_ref).stack()

    changed = ne_stacked[ne_stacked]

    changed.index.names = ['id', 'col']

    difference_locations = np.where(df != df_ref)

    changed_from = df.values[difference_locations]

    changed_to = df_ref.values[difference_locations]

    error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

    return error_test



compare_datasets(df3, df_ref3)

However, I would like to do this in pyspark. Does someone have an idea?

Thanks!

edited Nov 23 '18 at 10:41

Ali AzG

581515

asked Nov 23 '18 at 9:57

MVachelard

333

It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
– Ali AzG
Nov 23 '18 at 10:20

I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
– MVachelard
Nov 23 '18 at 11:27

add a comment |

I would like to compare two pyspark dataframes and get the differences in a new table.

I tested the way to do it using python:

my dataframe

data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, -999999, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df3

my reference dataframe

data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, 202, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df_ref3

Compare rows:

def compare_datasets(df, df_ref):

    ne_stacked = (df != df_ref).stack()

    changed = ne_stacked[ne_stacked]

    changed.index.names = ['id', 'col']

    difference_locations = np.where(df != df_ref)

    changed_from = df.values[difference_locations]

    changed_to = df_ref.values[difference_locations]

    error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

    return error_test



compare_datasets(df3, df_ref3)

However, I would like to do this in pyspark. Does someone have an idea?

Thanks!

edited Nov 23 '18 at 10:41

Ali AzG

581515

asked Nov 23 '18 at 9:57

MVachelard

333

I would like to compare two pyspark dataframes and get the differences in a new table.

I tested the way to do it using python:

my dataframe

data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, -999999, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df3

my reference dataframe

data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],

    'year': [2012, 202, 2013, 2014, 2014],

    'reports': [4, 24, 31, 2, 3]}

df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])

df_ref3

Compare rows:

def compare_datasets(df, df_ref):

    ne_stacked = (df != df_ref).stack()

    changed = ne_stacked[ne_stacked]

    changed.index.names = ['id', 'col']

    difference_locations = np.where(df != df_ref)

    changed_from = df.values[difference_locations]

    changed_to = df_ref.values[difference_locations]

    error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

    return error_test



compare_datasets(df3, df_ref3)

However, I would like to do this in pyspark. Does someone have an idea?

Thanks!

python pyspark pyspark-sql

edited Nov 23 '18 at 10:41

Ali AzG

581515

asked Nov 23 '18 at 9:57

MVachelard

333

edited Nov 23 '18 at 10:41

Ali AzG

581515

asked Nov 23 '18 at 9:57

MVachelard

333

edited Nov 23 '18 at 10:41

Ali AzG

581515

edited Nov 23 '18 at 10:41

Ali AzG

581515

edited Nov 23 '18 at 10:41

Ali AzG

581515

asked Nov 23 '18 at 9:57

MVachelard

333

asked Nov 23 '18 at 9:57

MVachelard

333

asked Nov 23 '18 at 9:57

MVachelard

333

It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
– Ali AzG
Nov 23 '18 at 10:20

I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
– MVachelard
Nov 23 '18 at 11:27

add a comment |

It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
– Ali AzG
Nov 23 '18 at 10:20

I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
– MVachelard
Nov 23 '18 at 11:27

It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
– Ali AzG
Nov 23 '18 at 10:20

I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
– MVachelard
Nov 23 '18 at 11:27

add a comment |

1 Answer
1

active

oldest

votes

It is probably difficult to reproduce exactly this behavior.
I offer you one partial solution :

df.show()

+----------+--------+-------+-------+

|     index|    name|   year|reports|

+----------+--------+-------+-------+

|   Cochice|NO_VALUE|   2012|      4|

|      Pima|   Molly|-999999|     24|

|Santa Cruz|    Tina|   2013|     31|

|  Maricopa|    Jake|   2014|      2|

|      Yuma|     Amy|   2014|      3|

+----------+--------+-------+-------+



df_ref.show()

+----------+-----+----+-------+

|     index| name|year|reports|

+----------+-----+----+-------+

|   Cochice| Jaso|2012|      4|

|      Pima|Molly|2012|     24|

|Santa Cruz| Tina|2013|     31|

|  Maricopa| Jake|2014|      2|

|      Yuma|  Amy|2014|      3|

+----------+-----+----+-------+



df.subtract(df_ref).show()

+-------+--------+-------+-------+

|  index|    name|   year|reports|

+-------+--------+-------+-------+

|   Pima|   Molly|-999999|     24|

|Cochice|NO_VALUE|   2012|      4|

+-------+--------+-------+-------+

Or you can do the slow one :

for col in df_ref.columns:

  out = df.select(col).subtract(df_ref.select(col))

  if out.first():

    print(out.collect())



[Row(name=u'NO_VALUE')]

[Row(year=-999999)]

edited Nov 23 '18 at 14:09

answered Nov 23 '18 at 13:48

Steven

2,46311033

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444371%2ftranslate-function-from-python-to-pyspark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It is probably difficult to reproduce exactly this behavior.
I offer you one partial solution :

df.show()

+----------+--------+-------+-------+

|     index|    name|   year|reports|

+----------+--------+-------+-------+

|   Cochice|NO_VALUE|   2012|      4|

|      Pima|   Molly|-999999|     24|

|Santa Cruz|    Tina|   2013|     31|

|  Maricopa|    Jake|   2014|      2|

|      Yuma|     Amy|   2014|      3|

+----------+--------+-------+-------+



df_ref.show()

+----------+-----+----+-------+

|     index| name|year|reports|

+----------+-----+----+-------+

|   Cochice| Jaso|2012|      4|

|      Pima|Molly|2012|     24|

|Santa Cruz| Tina|2013|     31|

|  Maricopa| Jake|2014|      2|

|      Yuma|  Amy|2014|      3|

+----------+-----+----+-------+



df.subtract(df_ref).show()

+-------+--------+-------+-------+

|  index|    name|   year|reports|

+-------+--------+-------+-------+

|   Pima|   Molly|-999999|     24|

|Cochice|NO_VALUE|   2012|      4|

+-------+--------+-------+-------+

Or you can do the slow one :

for col in df_ref.columns:

  out = df.select(col).subtract(df_ref.select(col))

  if out.first():

    print(out.collect())



[Row(name=u'NO_VALUE')]

[Row(year=-999999)]

edited Nov 23 '18 at 14:09

answered Nov 23 '18 at 13:48

Steven

2,46311033

add a comment |

It is probably difficult to reproduce exactly this behavior.
I offer you one partial solution :

df.show()

+----------+--------+-------+-------+

|     index|    name|   year|reports|

+----------+--------+-------+-------+

|   Cochice|NO_VALUE|   2012|      4|

|      Pima|   Molly|-999999|     24|

|Santa Cruz|    Tina|   2013|     31|

|  Maricopa|    Jake|   2014|      2|

|      Yuma|     Amy|   2014|      3|

+----------+--------+-------+-------+



df_ref.show()

+----------+-----+----+-------+

|     index| name|year|reports|

+----------+-----+----+-------+

|   Cochice| Jaso|2012|      4|

|      Pima|Molly|2012|     24|

|Santa Cruz| Tina|2013|     31|

|  Maricopa| Jake|2014|      2|

|      Yuma|  Amy|2014|      3|

+----------+-----+----+-------+



df.subtract(df_ref).show()

+-------+--------+-------+-------+

|  index|    name|   year|reports|

+-------+--------+-------+-------+

|   Pima|   Molly|-999999|     24|

|Cochice|NO_VALUE|   2012|      4|

+-------+--------+-------+-------+

Or you can do the slow one :

for col in df_ref.columns:

  out = df.select(col).subtract(df_ref.select(col))

  if out.first():

    print(out.collect())



[Row(name=u'NO_VALUE')]

[Row(year=-999999)]

edited Nov 23 '18 at 14:09

answered Nov 23 '18 at 13:48

Steven

2,46311033

add a comment |

It is probably difficult to reproduce exactly this behavior.
I offer you one partial solution :

df.show()

+----------+--------+-------+-------+

|     index|    name|   year|reports|

+----------+--------+-------+-------+

|   Cochice|NO_VALUE|   2012|      4|

|      Pima|   Molly|-999999|     24|

|Santa Cruz|    Tina|   2013|     31|

|  Maricopa|    Jake|   2014|      2|

|      Yuma|     Amy|   2014|      3|

+----------+--------+-------+-------+



df_ref.show()

+----------+-----+----+-------+

|     index| name|year|reports|

+----------+-----+----+-------+

|   Cochice| Jaso|2012|      4|

|      Pima|Molly|2012|     24|

|Santa Cruz| Tina|2013|     31|

|  Maricopa| Jake|2014|      2|

|      Yuma|  Amy|2014|      3|

+----------+-----+----+-------+



df.subtract(df_ref).show()

+-------+--------+-------+-------+

|  index|    name|   year|reports|

+-------+--------+-------+-------+

|   Pima|   Molly|-999999|     24|

|Cochice|NO_VALUE|   2012|      4|

+-------+--------+-------+-------+

Or you can do the slow one :

for col in df_ref.columns:

  out = df.select(col).subtract(df_ref.select(col))

  if out.first():

    print(out.collect())



[Row(name=u'NO_VALUE')]

[Row(year=-999999)]

edited Nov 23 '18 at 14:09

answered Nov 23 '18 at 13:48

Steven

2,46311033

It is probably difficult to reproduce exactly this behavior.
I offer you one partial solution :

df.show()

+----------+--------+-------+-------+

|     index|    name|   year|reports|

+----------+--------+-------+-------+

|   Cochice|NO_VALUE|   2012|      4|

|      Pima|   Molly|-999999|     24|

|Santa Cruz|    Tina|   2013|     31|

|  Maricopa|    Jake|   2014|      2|

|      Yuma|     Amy|   2014|      3|

+----------+--------+-------+-------+



df_ref.show()

+----------+-----+----+-------+

|     index| name|year|reports|

+----------+-----+----+-------+

|   Cochice| Jaso|2012|      4|

|      Pima|Molly|2012|     24|

|Santa Cruz| Tina|2013|     31|

|  Maricopa| Jake|2014|      2|

|      Yuma|  Amy|2014|      3|

+----------+-----+----+-------+



df.subtract(df_ref).show()

+-------+--------+-------+-------+

|  index|    name|   year|reports|

+-------+--------+-------+-------+

|   Pima|   Molly|-999999|     24|

|Cochice|NO_VALUE|   2012|      4|

+-------+--------+-------+-------+

Or you can do the slow one :

for col in df_ref.columns:

  out = df.select(col).subtract(df_ref.select(col))

  if out.first():

    print(out.collect())



[Row(name=u'NO_VALUE')]

[Row(year=-999999)]

edited Nov 23 '18 at 14:09

answered Nov 23 '18 at 13:48

Steven

2,46311033

edited Nov 23 '18 at 14:09

answered Nov 23 '18 at 13:48

Steven

2,46311033

answered Nov 23 '18 at 13:48

Steven

2,46311033

answered Nov 23 '18 at 13:48

Steven

2,46311033

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut