Home > Blockchain >  check if two columns containing sentences match and return only the matches for each dataframe
check if two columns containing sentences match and return only the matches for each dataframe

Time:04-05

I'm trying to match two columns in two different dataframes using this:

res = mergedStuff = pd.merge(df1, df2, on=['text'])

However for some reason, it also returns rows that are not a match.

Ideally, I would return two new dataframes containing only the rows that match in each one.

An example would look like this:

df1 = text                 | feature1 | feature2 | feature3 
     bananas are great     | 0        |  1       | 0  
     apples are better     | 1        |  1       | 0   
     grapes are okay       | 0        |  0       | 1  
     ice cream for the win | 1        |  0       | 1   


 df2 =      text               | feature1 | feature2 | feature3 
         bananas are great     | 0        |  1       | 0  
         apples are better     | 1        |  1       | 0   
         berries are yummy     | 0        |  0       | 1  
         ice cream for the win | 0        |  1       | 1  

Ideally, I would now return each dataframe but only with the rows that matched on the text column.

Expected result:

df1 =     text                 | feature1 | feature2 | feature3 
         bananas are great     | 0        |  1       | 0  
         apples are better     | 1        |  1       | 0   
         ice cream for the win | 1        |  0       | 1 




 df2 =      text               | feature1 | feature2 | feature3 
         bananas are great     | 0        |  1       | 0  
         apples are better     | 1        |  1       | 0   
         ice cream for the win | 0        |  1       | 1  

CodePudding user response:

You could use set.intersection; then filter the common texts:

common = set(df1['text']) & set(df2['text'])
df1 = df1[df1['text'].isin(common)]
df2 = df2[df2['text'].isin(common)]

Then df1 looks like:

                    text  feature1  feature2  feature3 
0      bananas are great         0         1         0
1      apples are better         1         1         0
3  ice cream for the win         1         0         1

and df2 looks like:

                    text  feature1  feature2  feature3
0      bananas are great         0         1         0
1      apples are better         1         1         0
3  ice cream for the win         0         1         1
  • Related