Python - Replace strings in a data frame

I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.

def adresses(df):  



    liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



    for i in liste_adresses:



        df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')



return df

My dataframe:

       A          B                                                                C

  French      house                      I live in 15 rue Louis Philippe 75001 Neuilly

 English      house               my address: 101-102 bd Charles de Gaulle 75001 Paris

  French  apartment                                                    my name is Liam

  French      house                                                       Hello George!

 English  apartment  This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it

On my output, nothing happens.

Good output:

       A          B                         C

  French      house                                I live in <address>

 English      house                              my address: <address>

  French  apartment                                    my name is Liam

  French      house                                       Hello George!

 English  apartment  This is wrong: <address> and I'm not happy with it

edited Nov 23 '18 at 14:43

Malik Asad

289110

asked Nov 23 '18 at 13:53

marin

38717

2

The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

– Ben.T
Nov 23 '18 at 14:35

2

In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

– Ben.T
Nov 23 '18 at 14:38

2

@Ben.T Not necessarily, I'll edit my dataframe. Thank you

– marin
Nov 23 '18 at 14:39

1

ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

– Ben.T
Nov 23 '18 at 14:52

Many cities in my data :(

– marin
Nov 23 '18 at 14:54

add a comment |

I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.

def adresses(df):  



    liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



    for i in liste_adresses:



        df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')



return df

My dataframe:

       A          B                                                                C

  French      house                      I live in 15 rue Louis Philippe 75001 Neuilly

 English      house               my address: 101-102 bd Charles de Gaulle 75001 Paris

  French  apartment                                                    my name is Liam

  French      house                                                       Hello George!

 English  apartment  This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it

On my output, nothing happens.

Good output:

       A          B                         C

  French      house                                I live in <address>

 English      house                              my address: <address>

  French  apartment                                    my name is Liam

  French      house                                       Hello George!

 English  apartment  This is wrong: <address> and I'm not happy with it

edited Nov 23 '18 at 14:43

Malik Asad

289110

asked Nov 23 '18 at 13:53

marin

38717

2

The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

– Ben.T
Nov 23 '18 at 14:35

2

In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

– Ben.T
Nov 23 '18 at 14:38

2

@Ben.T Not necessarily, I'll edit my dataframe. Thank you

– marin
Nov 23 '18 at 14:39

1

ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

– Ben.T
Nov 23 '18 at 14:52

Many cities in my data :(

– marin
Nov 23 '18 at 14:54

add a comment |

I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.

def adresses(df):  



    liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



    for i in liste_adresses:



        df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')



return df

My dataframe:

       A          B                                                                C

  French      house                      I live in 15 rue Louis Philippe 75001 Neuilly

 English      house               my address: 101-102 bd Charles de Gaulle 75001 Paris

  French  apartment                                                    my name is Liam

  French      house                                                       Hello George!

 English  apartment  This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it

On my output, nothing happens.

Good output:

       A          B                         C

  French      house                                I live in <address>

 English      house                              my address: <address>

  French  apartment                                    my name is Liam

  French      house                                       Hello George!

 English  apartment  This is wrong: <address> and I'm not happy with it

edited Nov 23 '18 at 14:43

Malik Asad

289110

asked Nov 23 '18 at 13:53

marin

38717

I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.

def adresses(df):  



    liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



    for i in liste_adresses:



        df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')



return df

My dataframe:

       A          B                                                                C

  French      house                      I live in 15 rue Louis Philippe 75001 Neuilly

 English      house               my address: 101-102 bd Charles de Gaulle 75001 Paris

  French  apartment                                                    my name is Liam

  French      house                                                       Hello George!

 English  apartment  This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it

On my output, nothing happens.

Good output:

       A          B                         C

  French      house                                I live in <address>

 English      house                              my address: <address>

  French  apartment                                    my name is Liam

  French      house                                       Hello George!

 English  apartment  This is wrong: <address> and I'm not happy with it

python pandas dataframe

edited Nov 23 '18 at 14:43

Malik Asad

289110

asked Nov 23 '18 at 13:53

marin

38717

edited Nov 23 '18 at 14:43

Malik Asad

289110

asked Nov 23 '18 at 13:53

marin

38717

edited Nov 23 '18 at 14:43

Malik Asad

289110

edited Nov 23 '18 at 14:43

Malik Asad

289110

edited Nov 23 '18 at 14:43

Malik Asad

289110

asked Nov 23 '18 at 13:53

marin

38717

asked Nov 23 '18 at 13:53

marin

38717

asked Nov 23 '18 at 13:53

marin

38717

2

The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

– Ben.T
Nov 23 '18 at 14:35

2

In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

– Ben.T
Nov 23 '18 at 14:38

2

@Ben.T Not necessarily, I'll edit my dataframe. Thank you

– marin
Nov 23 '18 at 14:39

1

ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

– Ben.T
Nov 23 '18 at 14:52

Many cities in my data :(

– marin
Nov 23 '18 at 14:54

add a comment |

2

The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

– Ben.T
Nov 23 '18 at 14:35

2

In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

– Ben.T
Nov 23 '18 at 14:38

2

@Ben.T Not necessarily, I'll edit my dataframe. Thank you

– marin
Nov 23 '18 at 14:39

1

ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

– Ben.T
Nov 23 '18 at 14:52

Many cities in my data :(

– marin
Nov 23 '18 at 14:54

The problem about nothing happens is that the variable i that contains the elements of liste_adresses is embedded in the regex you define '[0-9]+(,|s+)is+...' so it is looking for the letter i not its value (for example 'allée'). It would be more: '[0-9]+(,|s+)' + i + 's+...' and then something happens although it is not the expected output.

– Ben.T
Nov 23 '18 at 14:35

In your full data, does the strings in the column C ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it?

– Ben.T
Nov 23 '18 at 14:38

@Ben.T Not necessarily, I'll edit my dataframe. Thank you

– marin
Nov 23 '18 at 14:39

ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than liste_adresses ? or you have to many cities in you data?

– Ben.T
Nov 23 '18 at 14:52

Many cities in my data :(

– marin
Nov 23 '18 at 14:54

add a comment |

1 Answer
1

active

oldest

votes

The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:

a string with numbers at first '[0-9]+': all addresses start with a number

some characters (.*): for example to catch -102

any word from liste_adresses using '|'.join(liste_adresses)

the postal code of 5 digits [0-9]{5}

look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.

So globally, doing:

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',

                  'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',

                  'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',

                  'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'



print (df['C'].str.replace(reg,'<address>'))

0                                  I live in <address>

1                                my address: <address>

2                                      my name is Liam

3                                        Hello George!

4    This is wrong: <address> and I'm not happy wit...

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447983%2fpython-replace-strings-in-a-data-frame%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:

a string with numbers at first '[0-9]+': all addresses start with a number

some characters (.*): for example to catch -102

any word from liste_adresses using '|'.join(liste_adresses)

the postal code of 5 digits [0-9]{5}

look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.

So globally, doing:

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',

                  'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',

                  'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',

                  'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'



print (df['C'].str.replace(reg,'<address>'))

0                                  I live in <address>

1                                my address: <address>

2                                      my name is Liam

3                                        Hello George!

4    This is wrong: <address> and I'm not happy wit...

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

add a comment |

The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:

a string with numbers at first '[0-9]+': all addresses start with a number

some characters (.*): for example to catch -102

any word from liste_adresses using '|'.join(liste_adresses)

the postal code of 5 digits [0-9]{5}

look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.

So globally, doing:

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',

                  'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',

                  'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',

                  'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'



print (df['C'].str.replace(reg,'<address>'))

0                                  I live in <address>

1                                my address: <address>

2                                      my name is Liam

3                                        Hello George!

4    This is wrong: <address> and I'm not happy wit...

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

add a comment |

The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:

a string with numbers at first '[0-9]+': all addresses start with a number

some characters (.*): for example to catch -102

any word from liste_adresses using '|'.join(liste_adresses)

the postal code of 5 digits [0-9]{5}

look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.

So globally, doing:

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',

                  'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',

                  'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',

                  'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'



print (df['C'].str.replace(reg,'<address>'))

0                                  I live in <address>

1                                my address: <address>

2                                      my name is Liam

3                                        Hello George!

4    This is wrong: <address> and I'm not happy wit...

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:

a string with numbers at first '[0-9]+': all addresses start with a number

some characters (.*): for example to catch -102

any word from liste_adresses using '|'.join(liste_adresses)

the postal code of 5 digits [0-9]{5}

look for the city name if existing with ([^.|n]{0,2}[A-Z][a-z]*)*: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line [^.|n]{0,2}, then one upper case letter [A-Z] then any lower case [a-z]* until the end of the word, the extra at the end * would catch cities composed of two words like Saint-Denis.

So globally, doing:

liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',

                  'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',

                  'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',

                  'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']



reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'



print (df['C'].str.replace(reg,'<address>'))

0                                  I live in <address>

1                                my address: <address>

2                                      my name is Liam

3                                        Hello George!

4    This is wrong: <address> and I'm not happy wit...

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

answered Nov 23 '18 at 16:05

Ben.T

6,0072524

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut