Home > Enterprise >  How to use contains with multiple words
How to use contains with multiple words

Time:09-20

I am just wondering Can i use contains with 2 DataFrames? like grep in R. When i use merge() or join(), the row makes duplicated. this is for the example

For example

df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})

df1

id name class
1 hayley H
2 jack M
3 may S
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','S']})

df2

id1 name1 class1
4 cat H
5 chang H
6 jason S

and i want to match df1['class'] and df2['class1']. if the class, class1 rows has same word, concat. if not null.

sorry for my English. i don't really know how to explain it....

this is what i want to make it!!

id name class id1 name1 class1
1 hayley H
2 jack M
3 may S 6 json S

however, if use merge()

pd.merge(df1,df2,left_on='class', right_on='class1', how='left')
id name class id1 name1 class1
1 hayley H 4 cat H
1 hayley H 4 chang H
2 jack M
3 may S 6 json S

i want to know how to make that without duplicate. plese help :'(

CodePudding user response:

Use a boolean mask and then pandas.DataFrame.join to match only the rows with non duplicates values.

import pandas as pd

df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','S']})

m = df2.groupby('class1')['number1'].transform('count').eq(1)

out = df1.merge(df2.loc[m], left_on='class', right_on='class1', how='left')

>>> print(out)

   id    name class  number1  name1 class1
0   1  hayley     H      NaN    NaN    NaN
1   2    jack     M      NaN    NaN    NaN
2   3     may     S      6.0  jason      S

CodePudding user response:

Use DataFrame.drop_duplicates for remove rows with 2 or more class1 and then use left join:

df = df1.merge(df2.drop_duplicates('class1', keep=False),
               left_on='class',
               right_on='class1', 
               how='left')
print (df)
   id    name class  number1  name1 class1
0   1  hayley     H      NaN    NaN    NaN
1   2    jack     M      NaN    NaN    NaN
2   3     may     S      6.0  jason      S

Details:

print (df2.drop_duplicates('class1', keep=False))
   number1  name1 class1
2        6  jason      S

CodePudding user response:

Have you tried using drop_duplicates()? Try this code:

df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','Z']})
res = pd.concat([df1, df2]).drop_duplicates().reset_index()

print(res)

Result:

   index   id    name class  number1  name1 class1
0      0  1.0  hayley     H      NaN    NaN    NaN
1      1  2.0    jack     M      NaN    NaN    NaN
2      2  3.0     may     S      NaN    NaN    NaN
3      0  NaN     NaN   NaN      4.0    cat      H
4      1  NaN     NaN   NaN      5.0  chang      H
5      2  NaN     NaN   NaN      6.0  jason      Z
  • Related