Home > database >  Finding differences in two datasets, creating a 'check' variable
Finding differences in two datasets, creating a 'check' variable

Time:11-09

I have two datasets, one older than the other, and I would like to compare the differences between the two datasets, creating a new dataset that may include this information. Both the datasets have the same columns. For instance, dataset 1:

Cust1              Neigh_Cust2  Age   Net_Value
Mickey Mouse       Minnie       30     50000
Mickey Mouse       Snoopy       30     24000
Mickey Mouse       Duffy Duck   30     21000
Minnie             Mickey Mouse 25     30000
Minnie             Batman       25     12000

dataset 2 (most updated):

Cust1              Neigh_Cust2  Age   Net_Value
Mickey Mouse       Batman       30     23000
Mickey Mouse       Superman     30     24000
Mickey Mouse       Duffy Duck   30     21000
Minnie             Mickey Mouse 25     30000
Minnie             Batman       25     12000
Batman             Mickey Mouse 48     53000

The length of the two datasets might be different. My expected output would be

Cust1              Neigh_Cust2  Age   Net_Value  New/Missing?
Mickey Mouse       Batman       30     23000        New
Mickey Mouse       Superman     30     24000        New
Batman             Mickey Mouse 48     53000        New
Mickey Mouse       Minnie       30     50000        Missing
Mickey Mouse       Snoopy       30     24000        Missing

I have thought of using isin but I have two fields (Cust1 and Neigh_Cust2) that I am interested in to see which observations are included or not in the other dataset so maybe this is not the best way to look at the problem and get the expected output.

CodePudding user response:

I guess you could use merge with indicator=True:

(df1.merge(df2, on=list(df1.columns), indicator=True, how='outer')
    .query('_merge != "both"')
    .replace({'_merge': {'left_only': 'Missing', 'right_only': 'New'}})
    .rename(columns={'_merge': 'New/Missing?'})
)

NB. I am comparing here on all columns, but you can change this using the on parameter

output:

          Cust1   Neigh_Cust2  Age  Net_Value New/Missing?
0  Mickey Mouse        Minnie   30      50000      Missing
1  Mickey Mouse        Snoopy   30      24000      Missing
5  Mickey Mouse        Batman   30      23000          New
6  Mickey Mouse      Superman   30      24000          New
7        Batman  Mickey Mouse   48      53000          New

CodePudding user response:

For compare by some columns use:

a = df1.set_index(['Cust1','Neigh_Cust2']).index
b = df2.set_index(['Cust1','Neigh_Cust2']).index

df = pd.concat([df2[~b.isin(a)].assign(NewOrMissing = 'New'),
                df1[~a.isin(b)].assign(NewOrMissing = 'Missing')])
print (df)
          Cust1   Neigh_Cust2  Age  Net_Value NewOrMissing
0  Mickey Mouse        Batman   30      23000          New
1  Mickey Mouse      Superman   30      24000          New
5        Batman  Mickey Mouse   48      53000          New
0  Mickey Mouse        Minnie   30      50000      Missing
1  Mickey Mouse        Snoopy   30      24000      Missing

If no duplicates per Cust1, Neigh_Cust2 in original DataFrames, so duplicates are created only by join:

df = (pd.concat([df1.assign(NewOrMissing = 'Missing'), 
                df2.assign(NewOrMissing = 'New')])
        .drop_duplicates(['Cust1','Neigh_Cust2'], keep=False))
print (df)
          Cust1   Neigh_Cust2  Age  Net_Value NewOrMissing
0  Mickey Mouse        Minnie   30      50000      Missing
1  Mickey Mouse        Snoopy   30      24000      Missing
0  Mickey Mouse        Batman   30      23000          New
1  Mickey Mouse      Superman   30      24000          New
5        Batman  Mickey Mouse   48      53000          New
  • Related