check if two columns containing sentences match and return only the matches for each dataframe-CodePudding

I'm trying to match two columns in two different dataframes using this:

res = mergedStuff = pd.merge(df1, df2, on=['text'])

However for some reason, it also returns rows that are not a match.

Ideally, I would return two new dataframes containing only the rows that match in each one.

An example would look like this:

df1 = text                 | feature1 | feature2 | feature3 
     bananas are great     | 0        |  1       | 0  
     apples are better     | 1        |  1       | 0   
     grapes are okay       | 0        |  0       | 1  
     ice cream for the win | 1        |  0       | 1   


 df2 =      text               | feature1 | feature2 | feature3 
         bananas are great     | 0        |  1       | 0  
         apples are better     | 1        |  1       | 0   
         berries are yummy     | 0        |  0       | 1  
         ice cream for the win | 0        |  1       | 1

Ideally, I would now return each dataframe but only with the rows that matched on the text column.

Expected result:

df1 =     text                 | feature1 | feature2 | feature3 
         bananas are great     | 0        |  1       | 0  
         apples are better     | 1        |  1       | 0   
         ice cream for the win | 1        |  0       | 1 




 df2 =      text               | feature1 | feature2 | feature3 
         bananas are great     | 0        |  1       | 0  
         apples are better     | 1        |  1       | 0   
         ice cream for the win | 0        |  1       | 1

CodePudding user response：

You could use set.intersection; then filter the common texts:

common = set(df1['text']) & set(df2['text'])
df1 = df1[df1['text'].isin(common)]
df2 = df2[df2['text'].isin(common)]

Then df1 looks like:

                    text  feature1  feature2  feature3 
0      bananas are great         0         1         0
1      apples are better         1         1         0
3  ice cream for the win         1         0         1

and df2 looks like:

                    text  feature1  feature2  feature3
0      bananas are great         0         1         0
1      apples are better         1         1         0
3  ice cream for the win         0         1         1