How to use contains with multiple words-CodePudding

I am just wondering Can i use contains with 2 DataFrames? like grep in R. When i use merge() or join(), the row makes duplicated. this is for the example

For example

df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})

df1

id	name	class
1	hayley	H
2	jack	M
3	may	S

df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','S']})

df2

id1	name1	class1
4	cat	H
5	chang	H
6	jason	S

and i want to match df1['class'] and df2['class1']. if the class, class1 rows has same word, concat. if not null.

sorry for my English. i don't really know how to explain it....

this is what i want to make it!!

id	name	class	id1	name1	class1
1	hayley	H
2	jack	M
3	may	S	6	json	S

however, if use merge()

pd.merge(df1,df2,left_on='class', right_on='class1', how='left')

id	name	class	id1	name1	class1
1	hayley	H	4	cat	H
1	hayley	H	4	chang	H
2	jack	M
3	may	S	6	json	S

i want to know how to make that without duplicate. plese help :'(

CodePudding user response：

Use a boolean mask and then pandas.DataFrame.join to match only the rows with non duplicates values.

import pandas as pd

df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','S']})

m = df2.groupby('class1')['number1'].transform('count').eq(1)

out = df1.merge(df2.loc[m], left_on='class', right_on='class1', how='left')

`>>> print(out)`

   id    name class  number1  name1 class1
0   1  hayley     H      NaN    NaN    NaN
1   2    jack     M      NaN    NaN    NaN
2   3     may     S      6.0  jason      S

CodePudding user response：

Use DataFrame.drop_duplicates for remove rows with 2 or more class1 and then use left join:

df = df1.merge(df2.drop_duplicates('class1', keep=False),
               left_on='class',
               right_on='class1', 
               how='left')
print (df)
   id    name class  number1  name1 class1
0   1  hayley     H      NaN    NaN    NaN
1   2    jack     M      NaN    NaN    NaN
2   3     may     S      6.0  jason      S

Details:

print (df2.drop_duplicates('class1', keep=False))
   number1  name1 class1
2        6  jason      S

CodePudding user response：

Have you tried using drop_duplicates()? Try this code:

df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','Z']})
res = pd.concat([df1, df2]).drop_duplicates().reset_index()

print(res)

Result:

   index   id    name class  number1  name1 class1
0      0  1.0  hayley     H      NaN    NaN    NaN
1      1  2.0    jack     M      NaN    NaN    NaN
2      2  3.0     may     S      NaN    NaN    NaN
3      0  NaN     NaN   NaN      4.0    cat      H
4      1  NaN     NaN   NaN      5.0  chang      H
5      2  NaN     NaN   NaN      6.0  jason      Z