I'm trying to match two columns in two different dataframes using this:
res = mergedStuff = pd.merge(df1, df2, on=['text'])
However for some reason, it also returns rows that are not a match.
Ideally, I would return two new dataframes containing only the rows that match in each one.
An example would look like this:
df1 = text | feature1 | feature2 | feature3
bananas are great | 0 | 1 | 0
apples are better | 1 | 1 | 0
grapes are okay | 0 | 0 | 1
ice cream for the win | 1 | 0 | 1
df2 = text | feature1 | feature2 | feature3
bananas are great | 0 | 1 | 0
apples are better | 1 | 1 | 0
berries are yummy | 0 | 0 | 1
ice cream for the win | 0 | 1 | 1
Ideally, I would now return each dataframe but only with the rows that matched on the text
column.
Expected result:
df1 = text | feature1 | feature2 | feature3
bananas are great | 0 | 1 | 0
apples are better | 1 | 1 | 0
ice cream for the win | 1 | 0 | 1
df2 = text | feature1 | feature2 | feature3
bananas are great | 0 | 1 | 0
apples are better | 1 | 1 | 0
ice cream for the win | 0 | 1 | 1
CodePudding user response:
You could use set.intersection
; then filter the common texts:
common = set(df1['text']) & set(df2['text'])
df1 = df1[df1['text'].isin(common)]
df2 = df2[df2['text'].isin(common)]
Then df1
looks like:
text feature1 feature2 feature3
0 bananas are great 0 1 0
1 apples are better 1 1 0
3 ice cream for the win 1 0 1
and df2
looks like:
text feature1 feature2 feature3
0 bananas are great 0 1 0
1 apples are better 1 1 0
3 ice cream for the win 0 1 1