Merge two dataframes on string column - compound string column
I am trying to merge two huge dataframes (4+ millions each) that have the following structure:
Dataframe A:
date Fruit a b c d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9
Dataframe B:
date Fruits x y z
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t
What I am trying to achieve is a third dataframe with the following structure:
Dataframe C:
date Fruit a b c d x y z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...
I had already tried something like:
test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)
But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.
Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?
Fruit and Fruits columns type is string.
python pandas dataframe merge
add a comment |
I am trying to merge two huge dataframes (4+ millions each) that have the following structure:
Dataframe A:
date Fruit a b c d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9
Dataframe B:
date Fruits x y z
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t
What I am trying to achieve is a third dataframe with the following structure:
Dataframe C:
date Fruit a b c d x y z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...
I had already tried something like:
test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)
But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.
Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?
Fruit and Fruits columns type is string.
python pandas dataframe merge
What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.
– smci
Nov 23 '18 at 1:03
1
df_B.Fruits
is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"
– smci
Nov 23 '18 at 1:13
@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.
– Santiago Hernàndez
Nov 23 '18 at 14:11
add a comment |
I am trying to merge two huge dataframes (4+ millions each) that have the following structure:
Dataframe A:
date Fruit a b c d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9
Dataframe B:
date Fruits x y z
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t
What I am trying to achieve is a third dataframe with the following structure:
Dataframe C:
date Fruit a b c d x y z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...
I had already tried something like:
test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)
But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.
Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?
Fruit and Fruits columns type is string.
python pandas dataframe merge
I am trying to merge two huge dataframes (4+ millions each) that have the following structure:
Dataframe A:
date Fruit a b c d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9
Dataframe B:
date Fruits x y z
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t
What I am trying to achieve is a third dataframe with the following structure:
Dataframe C:
date Fruit a b c d x y z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...
I had already tried something like:
test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)
But my kernel breaks, I also had tried using a for cycle to expand the fruit column of B dataframe into a 'fruit-b' column, keeping the data from the other B columns and then merging between the date column and the 'fruit-B' columns, but the time of execution is too high.
Is there a way of obtaining dataframe C using dataframe A and B that does not consume a lot of time and memory?
Fruit and Fruits columns type is string.
python pandas dataframe merge
python pandas dataframe merge
edited Nov 23 '18 at 14:11
asked Nov 22 '18 at 23:13
Santiago Hernàndez
184
184
What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.
– smci
Nov 23 '18 at 1:03
1
df_B.Fruits
is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"
– smci
Nov 23 '18 at 1:13
@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.
– Santiago Hernàndez
Nov 23 '18 at 14:11
add a comment |
What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.
– smci
Nov 23 '18 at 1:03
1
df_B.Fruits
is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"
– smci
Nov 23 '18 at 1:13
@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.
– Santiago Hernàndez
Nov 23 '18 at 14:11
What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.
– smci
Nov 23 '18 at 1:03
What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.
– smci
Nov 23 '18 at 1:03
1
1
df_B.Fruits
is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"– smci
Nov 23 '18 at 1:13
df_B.Fruits
is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"– smci
Nov 23 '18 at 1:13
@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.
– Santiago Hernàndez
Nov 23 '18 at 14:11
@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.
– Santiago Hernàndez
Nov 23 '18 at 14:11
add a comment |
1 Answer
1
active
oldest
votes
Use:
print (df_A)
date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9
print (df_B)
date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t
import pandas as pd
import numpy as np
# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')
# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)
temp['Fruit'] = np.concatenate(df_B.Fruits.values)
df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)
print (df_C)
date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t
1
Thanks it's working!
– Santiago Hernàndez
Nov 23 '18 at 20:15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438964%2fmerge-two-dataframes-on-string-column-compound-string-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use:
print (df_A)
date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9
print (df_B)
date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t
import pandas as pd
import numpy as np
# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')
# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)
temp['Fruit'] = np.concatenate(df_B.Fruits.values)
df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)
print (df_C)
date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t
1
Thanks it's working!
– Santiago Hernàndez
Nov 23 '18 at 20:15
add a comment |
Use:
print (df_A)
date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9
print (df_B)
date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t
import pandas as pd
import numpy as np
# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')
# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)
temp['Fruit'] = np.concatenate(df_B.Fruits.values)
df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)
print (df_C)
date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t
1
Thanks it's working!
– Santiago Hernàndez
Nov 23 '18 at 20:15
add a comment |
Use:
print (df_A)
date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9
print (df_B)
date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t
import pandas as pd
import numpy as np
# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')
# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)
temp['Fruit'] = np.concatenate(df_B.Fruits.values)
df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)
print (df_C)
date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t
Use:
print (df_A)
date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9
print (df_B)
date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t
import pandas as pd
import numpy as np
# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')
# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)
temp['Fruit'] = np.concatenate(df_B.Fruits.values)
df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)
print (df_C)
date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t
edited Nov 23 '18 at 0:55
answered Nov 23 '18 at 0:33
Abhi
2,480320
2,480320
1
Thanks it's working!
– Santiago Hernàndez
Nov 23 '18 at 20:15
add a comment |
1
Thanks it's working!
– Santiago Hernàndez
Nov 23 '18 at 20:15
1
1
Thanks it's working!
– Santiago Hernàndez
Nov 23 '18 at 20:15
Thanks it's working!
– Santiago Hernàndez
Nov 23 '18 at 20:15
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53438964%2fmerge-two-dataframes-on-string-column-compound-string-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What is the total number of unique fruits occurring across df_A and df_B? You could convert them to one-hot or Categorical, instead of storing as string.
– smci
Nov 23 '18 at 1:03
1
df_B.Fruits
is a compound column. I would retitle this "Merge two dataframes on string column/ compound string column"– smci
Nov 23 '18 at 1:13
@smci, thanks for your response, the unique fruits number should be around 5000, i'll rename the question as you suggested too.
– Santiago Hernàndez
Nov 23 '18 at 14:11