Python - Replace strings in a data frame
I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.
def adresses(df):
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
for i in liste_adresses:
df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')
return df
My dataframe:
A B C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
On my output, nothing happens.
Good output:
A B C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it
python pandas dataframe
add a comment |
I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.
def adresses(df):
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
for i in liste_adresses:
df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')
return df
My dataframe:
A B C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
On my output, nothing happens.
Good output:
A B C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it
python pandas dataframe
2
The problem about nothing happens is that the variablei
that contains the elements ofliste_adresses
is embedded in the regex you define'[0-9]+(,|s+)is+...'
so it is looking for the letteri
not its value (for example'allée'
). It would be more:'[0-9]+(,|s+)' + i + 's+...'
and then something happens although it is not the expected output.
– Ben.T
Nov 23 '18 at 14:35
2
In your full data, does the strings in the columnC
ends by the address? By this, I mean could have more character after such asThis is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
?
– Ben.T
Nov 23 '18 at 14:38
2
@Ben.T Not necessarily, I'll edit my dataframe. Thank you
– marin
Nov 23 '18 at 14:39
1
ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way thanliste_adresses
? or you have to many cities in you data?
– Ben.T
Nov 23 '18 at 14:52
Many cities in my data :(
– marin
Nov 23 '18 at 14:54
add a comment |
I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.
def adresses(df):
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
for i in liste_adresses:
df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')
return df
My dataframe:
A B C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
On my output, nothing happens.
Good output:
A B C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it
python pandas dataframe
I'm trying to replace some adresses in French in a dataframe. I'm using a list and regular expressions and one list.
def adresses(df):
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV', 'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau', 'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie', 'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
for i in liste_adresses:
df['C'] = df['C'].str.replace(r'[0-9]+(,|s+)is+w+s+(w+)?(s+)?(w+)?(s+)?([0-9]{5})?(s+)?w+?([0-9]{5})?','<address>')
return df
My dataframe:
A B C
French house I live in 15 rue Louis Philippe 75001 Neuilly
English house my address: 101-102 bd Charles de Gaulle 75001 Paris
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
On my output, nothing happens.
Good output:
A B C
French house I live in <address>
English house my address: <address>
French apartment my name is Liam
French house Hello George!
English apartment This is wrong: <address> and I'm not happy with it
python pandas dataframe
python pandas dataframe
edited Nov 23 '18 at 14:43
Malik Asad
289110
289110
asked Nov 23 '18 at 13:53
marinmarin
38717
38717
2
The problem about nothing happens is that the variablei
that contains the elements ofliste_adresses
is embedded in the regex you define'[0-9]+(,|s+)is+...'
so it is looking for the letteri
not its value (for example'allée'
). It would be more:'[0-9]+(,|s+)' + i + 's+...'
and then something happens although it is not the expected output.
– Ben.T
Nov 23 '18 at 14:35
2
In your full data, does the strings in the columnC
ends by the address? By this, I mean could have more character after such asThis is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
?
– Ben.T
Nov 23 '18 at 14:38
2
@Ben.T Not necessarily, I'll edit my dataframe. Thank you
– marin
Nov 23 '18 at 14:39
1
ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way thanliste_adresses
? or you have to many cities in you data?
– Ben.T
Nov 23 '18 at 14:52
Many cities in my data :(
– marin
Nov 23 '18 at 14:54
add a comment |
2
The problem about nothing happens is that the variablei
that contains the elements ofliste_adresses
is embedded in the regex you define'[0-9]+(,|s+)is+...'
so it is looking for the letteri
not its value (for example'allée'
). It would be more:'[0-9]+(,|s+)' + i + 's+...'
and then something happens although it is not the expected output.
– Ben.T
Nov 23 '18 at 14:35
2
In your full data, does the strings in the columnC
ends by the address? By this, I mean could have more character after such asThis is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
?
– Ben.T
Nov 23 '18 at 14:38
2
@Ben.T Not necessarily, I'll edit my dataframe. Thank you
– marin
Nov 23 '18 at 14:39
1
ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way thanliste_adresses
? or you have to many cities in you data?
– Ben.T
Nov 23 '18 at 14:52
Many cities in my data :(
– marin
Nov 23 '18 at 14:54
2
2
The problem about nothing happens is that the variable
i
that contains the elements of liste_adresses
is embedded in the regex you define '[0-9]+(,|s+)is+...'
so it is looking for the letter i
not its value (for example 'allée'
). It would be more: '[0-9]+(,|s+)' + i + 's+...'
and then something happens although it is not the expected output.– Ben.T
Nov 23 '18 at 14:35
The problem about nothing happens is that the variable
i
that contains the elements of liste_adresses
is embedded in the regex you define '[0-9]+(,|s+)is+...'
so it is looking for the letter i
not its value (for example 'allée'
). It would be more: '[0-9]+(,|s+)' + i + 's+...'
and then something happens although it is not the expected output.– Ben.T
Nov 23 '18 at 14:35
2
2
In your full data, does the strings in the column
C
ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
?– Ben.T
Nov 23 '18 at 14:38
In your full data, does the strings in the column
C
ends by the address? By this, I mean could have more character after such as This is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
?– Ben.T
Nov 23 '18 at 14:38
2
2
@Ben.T Not necessarily, I'll edit my dataframe. Thank you
– marin
Nov 23 '18 at 14:39
@Ben.T Not necessarily, I'll edit my dataframe. Thank you
– marin
Nov 23 '18 at 14:39
1
1
ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than
liste_adresses
? or you have to many cities in you data?– Ben.T
Nov 23 '18 at 14:52
ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than
liste_adresses
? or you have to many cities in you data?– Ben.T
Nov 23 '18 at 14:52
Many cities in my data :(
– marin
Nov 23 '18 at 14:54
Many cities in my data :(
– marin
Nov 23 '18 at 14:54
add a comment |
1 Answer
1
active
oldest
votes
The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:
- a string with numbers at first
'[0-9]+'
: all addresses start with a number - some characters
(.*)
: for example to catch-102
- any word from
liste_adresses
using'|'.join(liste_adresses)
- the postal code of 5 digits
[0-9]{5}
- look for the city name if existing with
([^.|n]{0,2}[A-Z][a-z]*)*
: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line[^.|n]{0,2}
, then one upper case letter[A-Z]
then any lower case[a-z]*
until the end of the word, the extra at the end*
would catch cities composed of two words like Saint-Denis.
So globally, doing:
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'
print (df['C'].str.replace(reg,'<address>'))
0 I live in <address>
1 my address: <address>
2 my name is Liam
3 Hello George!
4 This is wrong: <address> and I'm not happy wit...
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447983%2fpython-replace-strings-in-a-data-frame%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:
- a string with numbers at first
'[0-9]+'
: all addresses start with a number - some characters
(.*)
: for example to catch-102
- any word from
liste_adresses
using'|'.join(liste_adresses)
- the postal code of 5 digits
[0-9]{5}
- look for the city name if existing with
([^.|n]{0,2}[A-Z][a-z]*)*
: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line[^.|n]{0,2}
, then one upper case letter[A-Z]
then any lower case[a-z]*
until the end of the word, the extra at the end*
would catch cities composed of two words like Saint-Denis.
So globally, doing:
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'
print (df['C'].str.replace(reg,'<address>'))
0 I live in <address>
1 my address: <address>
2 my name is Liam
3 Hello George!
4 This is wrong: <address> and I'm not happy wit...
add a comment |
The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:
- a string with numbers at first
'[0-9]+'
: all addresses start with a number - some characters
(.*)
: for example to catch-102
- any word from
liste_adresses
using'|'.join(liste_adresses)
- the postal code of 5 digits
[0-9]{5}
- look for the city name if existing with
([^.|n]{0,2}[A-Z][a-z]*)*
: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line[^.|n]{0,2}
, then one upper case letter[A-Z]
then any lower case[a-z]*
until the end of the word, the extra at the end*
would catch cities composed of two words like Saint-Denis.
So globally, doing:
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'
print (df['C'].str.replace(reg,'<address>'))
0 I live in <address>
1 my address: <address>
2 my name is Liam
3 Hello George!
4 This is wrong: <address> and I'm not happy wit...
add a comment |
The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:
- a string with numbers at first
'[0-9]+'
: all addresses start with a number - some characters
(.*)
: for example to catch-102
- any word from
liste_adresses
using'|'.join(liste_adresses)
- the postal code of 5 digits
[0-9]{5}
- look for the city name if existing with
([^.|n]{0,2}[A-Z][a-z]*)*
: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line[^.|n]{0,2}
, then one upper case letter[A-Z]
then any lower case[a-z]*
until the end of the word, the extra at the end*
would catch cities composed of two words like Saint-Denis.
So globally, doing:
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'
print (df['C'].str.replace(reg,'<address>'))
0 I live in <address>
1 my address: <address>
2 my name is Liam
3 Hello George!
4 This is wrong: <address> and I'm not happy wit...
The following solution may not works for specific cases. Because the end of the address is either the postal code or the city that you don't know, I think one way could be to look for:
- a string with numbers at first
'[0-9]+'
: all addresses start with a number - some characters
(.*)
: for example to catch-102
- any word from
liste_adresses
using'|'.join(liste_adresses)
- the postal code of 5 digits
[0-9]{5}
- look for the city name if existing with
([^.|n]{0,2}[A-Z][a-z]*)*
: here I assume that if there is a dot or a new line after the postal code, then the address is over, so match between 0 and 2 characters but not a dot or new line[^.|n]{0,2}
, then one upper case letter[A-Z]
then any lower case[a-z]*
until the end of the word, the extra at the end*
would catch cities composed of two words like Saint-Denis.
So globally, doing:
liste_adresses = ['allée', 'Allée', 'rue', 'Rue', 'avenue', 'Avenue', 'av', 'AV',
'boulevard', 'Boulevard', 'bd', 'Bd', 'carreau', 'Carreau',
'carrefour', 'Carrefour', 'place', 'Place', 'voie', 'Voie',
'villa', 'Villa', 'route', 'Route', 'quai', 'Quai']
reg = r'[0-9]+(.*)('+'|'.join(liste_adresses) + ')(.*)[0-9]{5}([^.|n]{0,2}[A-Z][a-z]*)*'
print (df['C'].str.replace(reg,'<address>'))
0 I live in <address>
1 my address: <address>
2 my name is Liam
3 Hello George!
4 This is wrong: <address> and I'm not happy wit...
answered Nov 23 '18 at 16:05
Ben.TBen.T
6,0072524
6,0072524
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447983%2fpython-replace-strings-in-a-data-frame%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
The problem about nothing happens is that the variable
i
that contains the elements ofliste_adresses
is embedded in the regex you define'[0-9]+(,|s+)is+...'
so it is looking for the letteri
not its value (for example'allée'
). It would be more:'[0-9]+(,|s+)' + i + 's+...'
and then something happens although it is not the expected output.– Ben.T
Nov 23 '18 at 14:35
2
In your full data, does the strings in the column
C
ends by the address? By this, I mean could have more character after such asThis is wrong: 4, rue Ledion Paris 75014 and I'm not happy with it
?– Ben.T
Nov 23 '18 at 14:38
2
@Ben.T Not necessarily, I'll edit my dataframe. Thank you
– marin
Nov 23 '18 at 14:39
1
ok, then it becomes a more difficult task to my opinion. Would you have a list of cities a bit the same way than
liste_adresses
? or you have to many cities in you data?– Ben.T
Nov 23 '18 at 14:52
Many cities in my data :(
– marin
Nov 23 '18 at 14:54