I have two data frames (df1
& df2
), after a merge, I find two duplicate rows (same values for three columns:["ID","City","Year"]
). I would like to keep one of the duplicate rows which comes from df2
.
import pandas as pd
data1 = {
'ID':[7,2],
'City': ["Berlin","Paris"],
'Year':[2012,2000],
'Number':[62,43],}
data2 ={
'ID': [7, 5],
'City': ["Berlin", "London"],
'Year':[2012,2019],
'Number': [60, 100], }
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df_merged= df1.append(df2)
Is there a way to do this ?
Expected output:
ID City Year Number
0 2 Paris 2000 43
1 7 Berlin 2012 60
2 5 London 2019 100
CodePudding user response:
new = (pd.concat([df1, df2])
.drop_duplicates(subset=["ID", "City", "Year"],
keep="last",
ignore_index=True))
append
will be gone in near future, use pd.concat
there please. Then drop_duplicates
over the said columns while keep="last"
:
In [376]: df1
Out[376]:
ID City Year Number
0 7 Berlin 2012 62
1 2 Paris 2000 43
In [377]: df2
Out[377]:
ID City Year Number
0 7 Berlin 2012 60
1 5 London 2019 100
In [378]: (pd.concat([df1, df2])
...: .drop_duplicates(subset=["ID", "City", "Year"],
...: keep="last",
...: ignore_index=True))
Out[378]:
ID City Year Number
0 2 Paris 2000 43
1 7 Berlin 2012 60
2 5 London 2019 100
ignore_index makes it again 0, 1, 2 after drop_duplicates
disturbs it