Home > front end >  Pandas Join creates unwanted duplicate, only want first instance
Pandas Join creates unwanted duplicate, only want first instance

Time:09-24

So I have 2 dataframes that I'm joining by their redefined index which is the number we use to identify the study, when I'm joining them they look like this:

df1 (contains all study numbers):

Index State PS
1001 CA 0
1002 NY 0
1003 NJ 1

df2 (does not contain all study numbers and contains duplicates):

Index Study
1001 Active
1002 Active
1002 Closed

I currently have df1 = df1.join(df2) which outputs:

Index State PS Study
1001 CA 0 Active
1002 NY 0 Active
1002 NY 0 Closed
1003 NJ 1

In this example df, I'd like only the first instance of 1002 in df2 to be merged with df1. Assuming it has something to do with 'how' or 'on', but I don't understand the documentation well enough as I am pretty new to Pandas. Thanks! Desired output is:

Index State PS Study
1001 CA 0 Active
1002 NY 0 Active
1003 NJ 1

CodePudding user response:

Try using drop_duplicates with keep="first" since it is sorted from newest to oldest. Then you merge on the key Index

df2 = df2.drop_duplicates(subset="Index", keep="first")
df = pd.merge(df1, df2, on="Index", how="left")

CodePudding user response:

import pandas as pd

dict1 = {
    'State': ['CA', 'NY', 'NJ'],
    'PS': [0, 0, 1]
}
dict2 = {
    'Study': ['Active', 'Active', 'Closed'],
}

df1 = pd.DataFrame(data=dict1, index=[1001, 1002, 1003])
df2 = pd.DataFrame(data=dict2, index=[1001, 1002, 1002])

print(df1)
print(df2)

answer = df1.join(df2).drop_duplicates(subset=['State', 'PS'], keep='first')

print(answer)

Crucially, the drop_duplicates method should be able to handle this special case

  • Related