I am just wondering Can i use contains with 2 DataFrames? like grep in R. When i use merge() or join(), the row makes duplicated. this is for the example
For example
df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})
df1
id | name | class |
---|---|---|
1 | hayley | H |
2 | jack | M |
3 | may | S |
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','S']})
df2
id1 | name1 | class1 |
---|---|---|
4 | cat | H |
5 | chang | H |
6 | jason | S |
and i want to match df1['class'] and df2['class1']. if the class, class1 rows has same word, concat. if not null.
sorry for my English. i don't really know how to explain it....
this is what i want to make it!!
id | name | class | id1 | name1 | class1 |
---|---|---|---|---|---|
1 | hayley | H | |||
2 | jack | M | |||
3 | may | S | 6 | json | S |
however, if use merge()
pd.merge(df1,df2,left_on='class', right_on='class1', how='left')
id | name | class | id1 | name1 | class1 |
---|---|---|---|---|---|
1 | hayley | H | 4 | cat | H |
1 | hayley | H | 4 | chang | H |
2 | jack | M | |||
3 | may | S | 6 | json | S |
i want to know how to make that without duplicate. plese help :'(
CodePudding user response:
Use a boolean mask and then pandas.DataFrame.join
to match only the rows with non duplicates values.
import pandas as pd
df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','S']})
m = df2.groupby('class1')['number1'].transform('count').eq(1)
out = df1.merge(df2.loc[m], left_on='class', right_on='class1', how='left')
>>> print(out)
id name class number1 name1 class1
0 1 hayley H NaN NaN NaN
1 2 jack M NaN NaN NaN
2 3 may S 6.0 jason S
CodePudding user response:
Use DataFrame.drop_duplicates
for remove rows with 2 or more class1
and then use left join:
df = df1.merge(df2.drop_duplicates('class1', keep=False),
left_on='class',
right_on='class1',
how='left')
print (df)
id name class number1 name1 class1
0 1 hayley H NaN NaN NaN
1 2 jack M NaN NaN NaN
2 3 may S 6.0 jason S
Details:
print (df2.drop_duplicates('class1', keep=False))
number1 name1 class1
2 6 jason S
CodePudding user response:
Have you tried using drop_duplicates()
? Try this code:
df1 = pd.DataFrame({'id':[1,2,3],'name':['hayley','jack','may'],'class':['H','M','S']})
df2 = pd.DataFrame({'number1':[4,5,6],'name1':['cat','chang','jason'],'class1':['H','H','Z']})
res = pd.concat([df1, df2]).drop_duplicates().reset_index()
print(res)
Result:
index id name class number1 name1 class1
0 0 1.0 hayley H NaN NaN NaN
1 1 2.0 jack M NaN NaN NaN
2 2 3.0 may S NaN NaN NaN
3 0 NaN NaN NaN 4.0 cat H
4 1 NaN NaN NaN 5.0 chang H
5 2 NaN NaN NaN 6.0 jason Z