Translate function from python to pyspark












2














I would like to compare two pyspark dataframes and get the differences in a new table.



I tested the way to do it using python:



my dataframe



data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, -999999, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3


my reference dataframe



data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 202, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3


Compare rows:



def compare_datasets(df, df_ref):
ne_stacked = (df != df_ref).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df_ref)
changed_from = df.values[difference_locations]
changed_to = df_ref.values[difference_locations]
error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
return error_test

compare_datasets(df3, df_ref3)


However, I would like to do this in pyspark. Does someone have an idea?



Thanks!










share|improve this question
























  • It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
    – Ali AzG
    Nov 23 '18 at 10:20










  • I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
    – MVachelard
    Nov 23 '18 at 11:27
















2














I would like to compare two pyspark dataframes and get the differences in a new table.



I tested the way to do it using python:



my dataframe



data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, -999999, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3


my reference dataframe



data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 202, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3


Compare rows:



def compare_datasets(df, df_ref):
ne_stacked = (df != df_ref).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df_ref)
changed_from = df.values[difference_locations]
changed_to = df_ref.values[difference_locations]
error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
return error_test

compare_datasets(df3, df_ref3)


However, I would like to do this in pyspark. Does someone have an idea?



Thanks!










share|improve this question
























  • It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
    – Ali AzG
    Nov 23 '18 at 10:20










  • I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
    – MVachelard
    Nov 23 '18 at 11:27














2












2








2







I would like to compare two pyspark dataframes and get the differences in a new table.



I tested the way to do it using python:



my dataframe



data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, -999999, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3


my reference dataframe



data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 202, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3


Compare rows:



def compare_datasets(df, df_ref):
ne_stacked = (df != df_ref).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df_ref)
changed_from = df.values[difference_locations]
changed_to = df_ref.values[difference_locations]
error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
return error_test

compare_datasets(df3, df_ref3)


However, I would like to do this in pyspark. Does someone have an idea?



Thanks!










share|improve this question















I would like to compare two pyspark dataframes and get the differences in a new table.



I tested the way to do it using python:



my dataframe



data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, -999999, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3


my reference dataframe



data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 202, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3


Compare rows:



def compare_datasets(df, df_ref):
ne_stacked = (df != df_ref).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df_ref)
changed_from = df.values[difference_locations]
changed_to = df_ref.values[difference_locations]
error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
return error_test

compare_datasets(df3, df_ref3)


However, I would like to do this in pyspark. Does someone have an idea?



Thanks!







python pyspark pyspark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 10:41









Ali AzG

581515




581515










asked Nov 23 '18 at 9:57









MVachelardMVachelard

333




333












  • It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
    – Ali AzG
    Nov 23 '18 at 10:20










  • I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
    – MVachelard
    Nov 23 '18 at 11:27


















  • It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
    – Ali AzG
    Nov 23 '18 at 10:20










  • I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
    – MVachelard
    Nov 23 '18 at 11:27
















It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
– Ali AzG
Nov 23 '18 at 10:20




It seems you are using pandas dataframe, not pyspark! in pyspark you have to convert your function into an UDF!
– Ali AzG
Nov 23 '18 at 10:20












I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
– MVachelard
Nov 23 '18 at 11:27




I know I have pandas dataframe. The fact is that I now want to do the same function but with pyspark dataframes and language.
– MVachelard
Nov 23 '18 at 11:27












1 Answer
1






active

oldest

votes


















0














It is probably difficult to reproduce exactly this behavior.
I offer you one partial solution :



df.show()
+----------+--------+-------+-------+
| index| name| year|reports|
+----------+--------+-------+-------+
| Cochice|NO_VALUE| 2012| 4|
| Pima| Molly|-999999| 24|
|Santa Cruz| Tina| 2013| 31|
| Maricopa| Jake| 2014| 2|
| Yuma| Amy| 2014| 3|
+----------+--------+-------+-------+

df_ref.show()
+----------+-----+----+-------+
| index| name|year|reports|
+----------+-----+----+-------+
| Cochice| Jaso|2012| 4|
| Pima|Molly|2012| 24|
|Santa Cruz| Tina|2013| 31|
| Maricopa| Jake|2014| 2|
| Yuma| Amy|2014| 3|
+----------+-----+----+-------+

df.subtract(df_ref).show()
+-------+--------+-------+-------+
| index| name| year|reports|
+-------+--------+-------+-------+
| Pima| Molly|-999999| 24|
|Cochice|NO_VALUE| 2012| 4|
+-------+--------+-------+-------+


Or you can do the slow one :



for col in df_ref.columns:
out = df.select(col).subtract(df_ref.select(col))
if out.first():
print(out.collect())

[Row(name=u'NO_VALUE')]
[Row(year=-999999)]





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444371%2ftranslate-function-from-python-to-pyspark%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    It is probably difficult to reproduce exactly this behavior.
    I offer you one partial solution :



    df.show()
    +----------+--------+-------+-------+
    | index| name| year|reports|
    +----------+--------+-------+-------+
    | Cochice|NO_VALUE| 2012| 4|
    | Pima| Molly|-999999| 24|
    |Santa Cruz| Tina| 2013| 31|
    | Maricopa| Jake| 2014| 2|
    | Yuma| Amy| 2014| 3|
    +----------+--------+-------+-------+

    df_ref.show()
    +----------+-----+----+-------+
    | index| name|year|reports|
    +----------+-----+----+-------+
    | Cochice| Jaso|2012| 4|
    | Pima|Molly|2012| 24|
    |Santa Cruz| Tina|2013| 31|
    | Maricopa| Jake|2014| 2|
    | Yuma| Amy|2014| 3|
    +----------+-----+----+-------+

    df.subtract(df_ref).show()
    +-------+--------+-------+-------+
    | index| name| year|reports|
    +-------+--------+-------+-------+
    | Pima| Molly|-999999| 24|
    |Cochice|NO_VALUE| 2012| 4|
    +-------+--------+-------+-------+


    Or you can do the slow one :



    for col in df_ref.columns:
    out = df.select(col).subtract(df_ref.select(col))
    if out.first():
    print(out.collect())

    [Row(name=u'NO_VALUE')]
    [Row(year=-999999)]





    share|improve this answer




























      0














      It is probably difficult to reproduce exactly this behavior.
      I offer you one partial solution :



      df.show()
      +----------+--------+-------+-------+
      | index| name| year|reports|
      +----------+--------+-------+-------+
      | Cochice|NO_VALUE| 2012| 4|
      | Pima| Molly|-999999| 24|
      |Santa Cruz| Tina| 2013| 31|
      | Maricopa| Jake| 2014| 2|
      | Yuma| Amy| 2014| 3|
      +----------+--------+-------+-------+

      df_ref.show()
      +----------+-----+----+-------+
      | index| name|year|reports|
      +----------+-----+----+-------+
      | Cochice| Jaso|2012| 4|
      | Pima|Molly|2012| 24|
      |Santa Cruz| Tina|2013| 31|
      | Maricopa| Jake|2014| 2|
      | Yuma| Amy|2014| 3|
      +----------+-----+----+-------+

      df.subtract(df_ref).show()
      +-------+--------+-------+-------+
      | index| name| year|reports|
      +-------+--------+-------+-------+
      | Pima| Molly|-999999| 24|
      |Cochice|NO_VALUE| 2012| 4|
      +-------+--------+-------+-------+


      Or you can do the slow one :



      for col in df_ref.columns:
      out = df.select(col).subtract(df_ref.select(col))
      if out.first():
      print(out.collect())

      [Row(name=u'NO_VALUE')]
      [Row(year=-999999)]





      share|improve this answer


























        0












        0








        0






        It is probably difficult to reproduce exactly this behavior.
        I offer you one partial solution :



        df.show()
        +----------+--------+-------+-------+
        | index| name| year|reports|
        +----------+--------+-------+-------+
        | Cochice|NO_VALUE| 2012| 4|
        | Pima| Molly|-999999| 24|
        |Santa Cruz| Tina| 2013| 31|
        | Maricopa| Jake| 2014| 2|
        | Yuma| Amy| 2014| 3|
        +----------+--------+-------+-------+

        df_ref.show()
        +----------+-----+----+-------+
        | index| name|year|reports|
        +----------+-----+----+-------+
        | Cochice| Jaso|2012| 4|
        | Pima|Molly|2012| 24|
        |Santa Cruz| Tina|2013| 31|
        | Maricopa| Jake|2014| 2|
        | Yuma| Amy|2014| 3|
        +----------+-----+----+-------+

        df.subtract(df_ref).show()
        +-------+--------+-------+-------+
        | index| name| year|reports|
        +-------+--------+-------+-------+
        | Pima| Molly|-999999| 24|
        |Cochice|NO_VALUE| 2012| 4|
        +-------+--------+-------+-------+


        Or you can do the slow one :



        for col in df_ref.columns:
        out = df.select(col).subtract(df_ref.select(col))
        if out.first():
        print(out.collect())

        [Row(name=u'NO_VALUE')]
        [Row(year=-999999)]





        share|improve this answer














        It is probably difficult to reproduce exactly this behavior.
        I offer you one partial solution :



        df.show()
        +----------+--------+-------+-------+
        | index| name| year|reports|
        +----------+--------+-------+-------+
        | Cochice|NO_VALUE| 2012| 4|
        | Pima| Molly|-999999| 24|
        |Santa Cruz| Tina| 2013| 31|
        | Maricopa| Jake| 2014| 2|
        | Yuma| Amy| 2014| 3|
        +----------+--------+-------+-------+

        df_ref.show()
        +----------+-----+----+-------+
        | index| name|year|reports|
        +----------+-----+----+-------+
        | Cochice| Jaso|2012| 4|
        | Pima|Molly|2012| 24|
        |Santa Cruz| Tina|2013| 31|
        | Maricopa| Jake|2014| 2|
        | Yuma| Amy|2014| 3|
        +----------+-----+----+-------+

        df.subtract(df_ref).show()
        +-------+--------+-------+-------+
        | index| name| year|reports|
        +-------+--------+-------+-------+
        | Pima| Molly|-999999| 24|
        |Cochice|NO_VALUE| 2012| 4|
        +-------+--------+-------+-------+


        Or you can do the slow one :



        for col in df_ref.columns:
        out = df.select(col).subtract(df_ref.select(col))
        if out.first():
        print(out.collect())

        [Row(name=u'NO_VALUE')]
        [Row(year=-999999)]






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 23 '18 at 14:09

























        answered Nov 23 '18 at 13:48









        StevenSteven

        2,46311033




        2,46311033






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444371%2ftranslate-function-from-python-to-pyspark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Berounka

            Sphinx de Gizeh

            Different font size/position of beamer's navigation symbols template's content depending on regular/plain...