Pandas Join creates unwanted duplicate, only want first instance-CodePudding

So I have 2 dataframes that I'm joining by their redefined index which is the number we use to identify the study, when I'm joining them they look like this:

df1 (contains all study numbers):

Index	State	PS
1001	CA	0
1002	NY	0
1003	NJ	1

df2 (does not contain all study numbers and contains duplicates):

Index	Study
1001	Active
1002	Active
1002	Closed

I currently have df1 = df1.join(df2) which outputs:

Index	State	PS	Study
1001	CA	0	Active
1002	NY	0	Active
1002	NY	0	Closed
1003	NJ	1

In this example df, I'd like only the first instance of 1002 in df2 to be merged with df1. Assuming it has something to do with 'how' or 'on', but I don't understand the documentation well enough as I am pretty new to Pandas. Thanks! Desired output is:

Index	State	PS	Study
1001	CA	0	Active
1002	NY	0	Active
1003	NJ	1

CodePudding user response：

Try using drop_duplicates with keep="first" since it is sorted from newest to oldest. Then you merge on the key Index

df2 = df2.drop_duplicates(subset="Index", keep="first")
df = pd.merge(df1, df2, on="Index", how="left")

CodePudding user response：

import pandas as pd

dict1 = {
    'State': ['CA', 'NY', 'NJ'],
    'PS': [0, 0, 1]
}
dict2 = {
    'Study': ['Active', 'Active', 'Closed'],
}

df1 = pd.DataFrame(data=dict1, index=[1001, 1002, 1003])
df2 = pd.DataFrame(data=dict2, index=[1001, 1002, 1002])

print(df1)
print(df2)

answer = df1.join(df2).drop_duplicates(subset=['State', 'PS'], keep='first')

print(answer)

Crucially, the drop_duplicates method should be able to handle this special case