Compare per row values in four different columns-CodePudding

I have the following DataFrame:

df = pd.DataFrame({'DEC': ['Food','Kyl','Food','Fashion'], 
                   'JAN': ['Food','Kyl','Kyl','Food'],
                   'FEB': ['Food','Kyl','Kyl','Food'],
                   'MAR': ['Food','Kyl','Kyl','Fashion'],
                   'COUNTS': [4988, 1976, 797, 613]})

DEC     JAN     FEB     MAR     COUNTS
Food    Food    Food    Food    4988
Kyl     Kyl     Kyl     Kyl     1976
Food    Kyl     Kyl     Kyl     797
Fashion Food    Food    Fashion 613

What I want to do is make a per row comparison of the four columns DEC-MAR and create a new column that outputs either the value if they are all the same, or the two-three-four values in case of differences.

So, the desired output DataFrame is

DEC     JAN     FEB     MAR     COUNTS   COMPARISON
Food    Food    Food    Food    4988     Food
Kyl     Kyl     Kyl     Kyl     1976     Kyl
Food    Kyl     Kyl     Kyl     797      Food-Kyl
Fashion Food    Food    Fashion 61       Fashion-Food

I thought I could do it by defining a function which uses np.where and makes the comparisons, but there are so many comparisons to go through that it wouldn't be efficient (i.e. Col1==Col2 or Col1==Col3 or Col1==Col4 or Col2==Cole3 etc....)

This is what I started but gave up due to all the manual work, hoping for a better solution:

def group_combs(row):
    if (row['DEC'] == row['JAN']) & 
    (row['DEC'] == row['JAN']) & 
    (row['DEC'] == row['JAN']):
        val = row['DEC']
        
        
    elif row['FEB'] == row['MAR']:
        val = row['FEB']
    else:
        val = -1
    return val

Is there a better way to do this other than comparing all columns manually?

Thanks in advance!!

CodePudding user response：

Use list comprehension for deduplicated join values for improve performance:

df['COMPARISON'] = ['-'.join(dict.fromkeys(x))for x in df.drop('COUNTS', axis=1).to_numpy()]
print (df)
       DEC   JAN   FEB      MAR  COUNTS    COMPARISON
0     Food  Food  Food     Food    4988          Food
1      Kyl   Kyl   Kyl      Kyl    1976           Kyl
2     Food   Kyl   Kyl      Kyl     797      Food-Kyl
3  Fashion  Food  Food  Fashion     613  Fashion-Food

Timings:

#4k rows
df = pd.concat([df] * 1000, ignore_index=True)

In [135]: %timeit df['COMPARISON'] = (df.drop(columns='COUNTS').agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1))
176 ms ± 6.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [136]: %timeit df['COMPARISON'] = ['-'.join(dict.fromkeys(x)) for x in df.drop('COUNTS', axis=1).to_numpy()]
15.9 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

#40k rows
df = pd.concat([df] * 10000, ignore_index=True)


In [138]: %timeit df['COMPARISON'] = (df.drop(columns='COUNTS').agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1))
1.68 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [139]: %timeit df['COMPARISON'] = ['-'.join(dict.fromkeys(x))for x in df.drop('COUNTS', axis=1).to_numpy()]
87.7 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

Drop the unwanted columns with drop and aggregate the unique values (in order) with groupby agg and dict.fromkeys, then join them back:

df['COMPARISON'] = (df
                   .drop(columns='COUNTS')
                   .agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1)
                   )

output:

       DEC   JAN   FEB      MAR  COUNTS    COMPARISON
0     Food  Food  Food     Food    4988          Food
1      Kyl   Kyl   Kyl      Kyl    1976           Kyl
2     Food   Kyl   Kyl      Kyl     797      Food-Kyl
3  Fashion  Food  Food  Fashion     613  Fashion-Food