Python - Replace strings in a data frame












2















I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.



def adresses(df):  

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

for i in liste_adresses:

df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')

return df


My dataframe:



       A          B                                                                C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it


On my output, nothing happens.



Good output:



       A          B                         C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it









share|improve this question




















  • 2





    The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

    – Ben.T
    Nov 23 '18 at 14:35








  • 2





    In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

    – Ben.T
    Nov 23 '18 at 14:38








  • 2





    @Ben.T Not necessarily, I'll edit my dataframe. Thank you

    – marin
    Nov 23 '18 at 14:39






  • 1





    ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

    – Ben.T
    Nov 23 '18 at 14:52











  • Many cities in my data :(

    – marin
    Nov 23 '18 at 14:54
















2















I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.



def adresses(df):  

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

for i in liste_adresses:

df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')

return df


My dataframe:



       A          B                                                                C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it


On my output, nothing happens.



Good output:



       A          B                         C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it









share|improve this question




















  • 2





    The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

    – Ben.T
    Nov 23 '18 at 14:35








  • 2





    In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

    – Ben.T
    Nov 23 '18 at 14:38








  • 2





    @Ben.T Not necessarily, I'll edit my dataframe. Thank you

    – marin
    Nov 23 '18 at 14:39






  • 1





    ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

    – Ben.T
    Nov 23 '18 at 14:52











  • Many cities in my data :(

    – marin
    Nov 23 '18 at 14:54














2












2








2








I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.



def adresses(df):  

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

for i in liste_adresses:

df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')

return df


My dataframe:



       A          B                                                                C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it


On my output, nothing happens.



Good output:



       A          B                         C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it









share|improve this question
















I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.



def adresses(df):  

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

for i in liste_adresses:

df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')

return df


My dataframe:



       A          B                                                                C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it


On my output, nothing happens.



Good output:



       A          B                         C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it






python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 14:43









Malik Asad

289110




289110










asked Nov 23 '18 at 13:53









marinmarin

38717




38717








  • 2





    The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

    – Ben.T
    Nov 23 '18 at 14:35








  • 2





    In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

    – Ben.T
    Nov 23 '18 at 14:38








  • 2





    @Ben.T Not necessarily, I'll edit my dataframe. Thank you

    – marin
    Nov 23 '18 at 14:39






  • 1





    ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

    – Ben.T
    Nov 23 '18 at 14:52











  • Many cities in my data :(

    – marin
    Nov 23 '18 at 14:54














  • 2





    The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

    – Ben.T
    Nov 23 '18 at 14:35








  • 2





    In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

    – Ben.T
    Nov 23 '18 at 14:38








  • 2





    @Ben.T Not necessarily, I'll edit my dataframe. Thank you

    – marin
    Nov 23 '18 at 14:39






  • 1





    ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

    – Ben.T
    Nov 23 '18 at 14:52











  • Many cities in my data :(

    – marin
    Nov 23 '18 at 14:54








2




2





The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

– Ben.T
Nov 23 '18 at 14:35







The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

– Ben.T
Nov 23 '18 at 14:35






2




2





In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

– Ben.T
Nov 23 '18 at 14:38







In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

– Ben.T
Nov 23 '18 at 14:38






2




2





@Ben.T Not necessarily, I'll edit my dataframe. Thank you

– marin
Nov 23 '18 at 14:39





@Ben.T Not necessarily, I'll edit my dataframe. Thank you

– marin
Nov 23 '18 at 14:39




1




1





ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

– Ben.T
Nov 23 '18 at 14:52





ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

– Ben.T
Nov 23 '18 at 14:52













Many cities in my data :(

– marin
Nov 23 '18 at 14:54





Many cities in my data :(

– marin
Nov 23 '18 at 14:54












1 Answer
1






active

oldest

votes


















3














The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:




  1. a string with numbers at first '[0-9]+': all addresses start with a number

  2. some characters (.*): for example to catch -102

  3. any word from liste_adresses using '|'.join(liste_adresses)

  4. the postal code of 5 digits [0-9]{5}

  5. look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.


So globally, doing:



liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'

print (df['C'].str.replace(reg,'<address>'))
0 I live in <address>
1 my address: <address>
2 my name is Liam
3 Hello George!
4 This is wrong: <address> and I'm not happy wit...





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447983%2fpython-replace-strings-in-a-data-frame%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:




    1. a string with numbers at first '[0-9]+': all addresses start with a number

    2. some characters (.*): for example to catch -102

    3. any word from liste_adresses using '|'.join(liste_adresses)

    4. the postal code of 5 digits [0-9]{5}

    5. look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.


    So globally, doing:



    liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
    'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
    'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
    'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

    reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'

    print (df['C'].str.replace(reg,'<address>'))
    0 I live in <address>
    1 my address: <address>
    2 my name is Liam
    3 Hello George!
    4 This is wrong: <address> and I'm not happy wit...





    share|improve this answer




























      3














      The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:




      1. a string with numbers at first '[0-9]+': all addresses start with a number

      2. some characters (.*): for example to catch -102

      3. any word from liste_adresses using '|'.join(liste_adresses)

      4. the postal code of 5 digits [0-9]{5}

      5. look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.


      So globally, doing:



      liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
      'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
      'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
      'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

      reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'

      print (df['C'].str.replace(reg,'<address>'))
      0 I live in <address>
      1 my address: <address>
      2 my name is Liam
      3 Hello George!
      4 This is wrong: <address> and I'm not happy wit...





      share|improve this answer


























        3












        3








        3







        The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:




        1. a string with numbers at first '[0-9]+': all addresses start with a number

        2. some characters (.*): for example to catch -102

        3. any word from liste_adresses using '|'.join(liste_adresses)

        4. the postal code of 5 digits [0-9]{5}

        5. look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.


        So globally, doing:



        liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
        'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
        'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
        'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

        reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'

        print (df['C'].str.replace(reg,'<address>'))
        0 I live in <address>
        1 my address: <address>
        2 my name is Liam
        3 Hello George!
        4 This is wrong: <address> and I'm not happy wit...





        share|improve this answer













        The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:




        1. a string with numbers at first '[0-9]+': all addresses start with a number

        2. some characters (.*): for example to catch -102

        3. any word from liste_adresses using '|'.join(liste_adresses)

        4. the postal code of 5 digits [0-9]{5}

        5. look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.


        So globally, doing:



        liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
        'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
        'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
        'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']

        reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'

        print (df['C'].str.replace(reg,'<address>'))
        0 I live in <address>
        1 my address: <address>
        2 my name is Liam
        3 Hello George!
        4 This is wrong: <address> and I'm not happy wit...






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 23 '18 at 16:05









        Ben.TBen.T

        6,0072524




        6,0072524






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447983%2fpython-replace-strings-in-a-data-frame%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Berounka

            Different font size/position of beamer's navigation symbols template's content depending on regular/plain...

            Sphinx de Gizeh