I have the following DataFrame:
df = pd.DataFrame({'DEC': ['Food','Kyl','Food','Fashion'],
'JAN': ['Food','Kyl','Kyl','Food'],
'FEB': ['Food','Kyl','Kyl','Food'],
'MAR': ['Food','Kyl','Kyl','Fashion'],
'COUNTS': [4988, 1976, 797, 613]})
DEC JAN FEB MAR COUNTS
Food Food Food Food 4988
Kyl Kyl Kyl Kyl 1976
Food Kyl Kyl Kyl 797
Fashion Food Food Fashion 613
What I want to do is make a per row comparison of the four columns DEC-MAR and create a new column that outputs either the value if they are all the same, or the two-three-four values in case of differences.
So, the desired output DataFrame is
DEC JAN FEB MAR COUNTS COMPARISON
Food Food Food Food 4988 Food
Kyl Kyl Kyl Kyl 1976 Kyl
Food Kyl Kyl Kyl 797 Food-Kyl
Fashion Food Food Fashion 61 Fashion-Food
I thought I could do it by defining a function which uses np.where
and makes the comparisons, but there are so many comparisons to go through that it wouldn't be efficient (i.e. Col1==Col2 or Col1==Col3 or Col1==Col4 or Col2==Cole3 etc....)
This is what I started but gave up due to all the manual work, hoping for a better solution:
def group_combs(row):
if (row['DEC'] == row['JAN']) &
(row['DEC'] == row['JAN']) &
(row['DEC'] == row['JAN']):
val = row['DEC']
elif row['FEB'] == row['MAR']:
val = row['FEB']
else:
val = -1
return val
Is there a better way to do this other than comparing all columns manually?
Thanks in advance!!
CodePudding user response:
Use list comprehension for deduplicated join values for improve performance:
df['COMPARISON'] = ['-'.join(dict.fromkeys(x))for x in df.drop('COUNTS', axis=1).to_numpy()]
print (df)
DEC JAN FEB MAR COUNTS COMPARISON
0 Food Food Food Food 4988 Food
1 Kyl Kyl Kyl Kyl 1976 Kyl
2 Food Kyl Kyl Kyl 797 Food-Kyl
3 Fashion Food Food Fashion 613 Fashion-Food
Timings:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [135]: %timeit df['COMPARISON'] = (df.drop(columns='COUNTS').agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1))
176 ms ± 6.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [136]: %timeit df['COMPARISON'] = ['-'.join(dict.fromkeys(x)) for x in df.drop('COUNTS', axis=1).to_numpy()]
15.9 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [138]: %timeit df['COMPARISON'] = (df.drop(columns='COUNTS').agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1))
1.68 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [139]: %timeit df['COMPARISON'] = ['-'.join(dict.fromkeys(x))for x in df.drop('COUNTS', axis=1).to_numpy()]
87.7 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Drop the unwanted columns with drop
and aggregate the unique values (in order) with groupby
agg
and dict.fromkeys
, then join
them back:
df['COMPARISON'] = (df
.drop(columns='COUNTS')
.agg(lambda s: '-'.join(dict.fromkeys(s)), axis=1)
)
output:
DEC JAN FEB MAR COUNTS COMPARISON
0 Food Food Food Food 4988 Food
1 Kyl Kyl Kyl Kyl 1976 Kyl
2 Food Kyl Kyl Kyl 797 Food-Kyl
3 Fashion Food Food Fashion 613 Fashion-Food