So I have 2 dataframes that I'm joining by their redefined index which is the number we use to identify the study, when I'm joining them they look like this:
df1 (contains all study numbers):
Index | State | PS |
---|---|---|
1001 | CA | 0 |
1002 | NY | 0 |
1003 | NJ | 1 |
df2 (does not contain all study numbers and contains duplicates):
Index | Study |
---|---|
1001 | Active |
1002 | Active |
1002 | Closed |
I currently have df1 = df1.join(df2) which outputs:
Index | State | PS | Study |
---|---|---|---|
1001 | CA | 0 | Active |
1002 | NY | 0 | Active |
1002 | NY | 0 | Closed |
1003 | NJ | 1 |
In this example df, I'd like only the first instance of 1002 in df2 to be merged with df1. Assuming it has something to do with 'how' or 'on', but I don't understand the documentation well enough as I am pretty new to Pandas. Thanks! Desired output is:
Index | State | PS | Study |
---|---|---|---|
1001 | CA | 0 | Active |
1002 | NY | 0 | Active |
1003 | NJ | 1 |
CodePudding user response:
Try using drop_duplicates
with keep="first"
since it is sorted from newest to oldest. Then you merge on the key Index
df2 = df2.drop_duplicates(subset="Index", keep="first")
df = pd.merge(df1, df2, on="Index", how="left")
CodePudding user response:
import pandas as pd
dict1 = {
'State': ['CA', 'NY', 'NJ'],
'PS': [0, 0, 1]
}
dict2 = {
'Study': ['Active', 'Active', 'Closed'],
}
df1 = pd.DataFrame(data=dict1, index=[1001, 1002, 1003])
df2 = pd.DataFrame(data=dict2, index=[1001, 1002, 1002])
print(df1)
print(df2)
answer = df1.join(df2).drop_duplicates(subset=['State', 'PS'], keep='first')
print(answer)
Crucially, the drop_duplicates method should be able to handle this special case