I have two datasets, one older than the other, and I would like to compare the differences between the two datasets, creating a new dataset that may include this information. Both the datasets have the same columns. For instance, dataset 1:
Cust1 Neigh_Cust2 Age Net_Value
Mickey Mouse Minnie 30 50000
Mickey Mouse Snoopy 30 24000
Mickey Mouse Duffy Duck 30 21000
Minnie Mickey Mouse 25 30000
Minnie Batman 25 12000
dataset 2 (most updated):
Cust1 Neigh_Cust2 Age Net_Value
Mickey Mouse Batman 30 23000
Mickey Mouse Superman 30 24000
Mickey Mouse Duffy Duck 30 21000
Minnie Mickey Mouse 25 30000
Minnie Batman 25 12000
Batman Mickey Mouse 48 53000
The length of the two datasets might be different. My expected output would be
Cust1 Neigh_Cust2 Age Net_Value New/Missing?
Mickey Mouse Batman 30 23000 New
Mickey Mouse Superman 30 24000 New
Batman Mickey Mouse 48 53000 New
Mickey Mouse Minnie 30 50000 Missing
Mickey Mouse Snoopy 30 24000 Missing
I have thought of using isin but I have two fields (Cust1
and Neigh_Cust2
) that I am interested in to see which observations are included or not in the other dataset so maybe this is not the best way to look at the problem and get the expected output.
CodePudding user response:
I guess you could use merge
with indicator=True
:
(df1.merge(df2, on=list(df1.columns), indicator=True, how='outer')
.query('_merge != "both"')
.replace({'_merge': {'left_only': 'Missing', 'right_only': 'New'}})
.rename(columns={'_merge': 'New/Missing?'})
)
NB. I am comparing here on all columns, but you can change this using the on
parameter
output:
Cust1 Neigh_Cust2 Age Net_Value New/Missing?
0 Mickey Mouse Minnie 30 50000 Missing
1 Mickey Mouse Snoopy 30 24000 Missing
5 Mickey Mouse Batman 30 23000 New
6 Mickey Mouse Superman 30 24000 New
7 Batman Mickey Mouse 48 53000 New
CodePudding user response:
For compare by some columns use:
a = df1.set_index(['Cust1','Neigh_Cust2']).index
b = df2.set_index(['Cust1','Neigh_Cust2']).index
df = pd.concat([df2[~b.isin(a)].assign(NewOrMissing = 'New'),
df1[~a.isin(b)].assign(NewOrMissing = 'Missing')])
print (df)
Cust1 Neigh_Cust2 Age Net_Value NewOrMissing
0 Mickey Mouse Batman 30 23000 New
1 Mickey Mouse Superman 30 24000 New
5 Batman Mickey Mouse 48 53000 New
0 Mickey Mouse Minnie 30 50000 Missing
1 Mickey Mouse Snoopy 30 24000 Missing
If no duplicates per Cust1, Neigh_Cust2
in original DataFrames, so duplicates are created only by join:
df = (pd.concat([df1.assign(NewOrMissing = 'Missing'),
df2.assign(NewOrMissing = 'New')])
.drop_duplicates(['Cust1','Neigh_Cust2'], keep=False))
print (df)
Cust1 Neigh_Cust2 Age Net_Value NewOrMissing
0 Mickey Mouse Minnie 30 50000 Missing
1 Mickey Mouse Snoopy 30 24000 Missing
0 Mickey Mouse Batman 30 23000 New
1 Mickey Mouse Superman 30 24000 New
5 Batman Mickey Mouse 48 53000 New