Home > front end >  Counting unique values cumulatively by date in pandas and add a column
Counting unique values cumulatively by date in pandas and add a column

Time:04-19

I have the following data. I want to create a column which is a cumulative sum of number of unique values from both "from" and "to" columns by date.

Current dataframe

id  date         from   to      
A   01/01/2020  alice  bob        
A   01/01/2020  alice  carl        
A   01/02/2020  jim    peter
A   01/02/2020  jim    jen
B   01/02/2020  mary   john
B   01/03/2020  john   max
B   01/03/2020  mike   jane
B   01/03/2020  jane   jim
C   01/02/2020  xavier jim
C   01/02/2020  jim    alice
C   01/03/2020  alice  bob
C   01/03/2020  manny  lily
C   01/04/2020  rosa   bob
.       .         .      .
.
.
.
.
.

I want

id       date   cum_sum
A   01/01/2020  3
A   01/02/2020  6 
B   01/02/2020  2
B   01/03/2020  7
C   01/02/2020  3
C   01/03/2020  6
C   01/04/2020  7
.       .       .
.
.
.
.
.

I tried something like the following

cum_sum = df.groupby(['id', pd.Grouper(freq='D', key='date')])['from', 'to_'].nunique.cumsum()

but no success. any help will be appreciated!

UPDATE i have tried this which works but I can only add one column "from". Any idea how to add "to" as well??

df1 = (df.groupby(['id','date'])['from']
         .apply(list)
         .groupby(level=0)
         .apply(np.cumsum)
         .apply(lambda x: len(set(x)))
         .reset_index(name='cum_sum'))

CodePudding user response:

You could melt the DataFrame, groupby.agg to construct lists for each id-date pair, then use again groupby to apply numpy.cumsum to concatenate the lists. Then use apply yet again to get the length of unique sets for each id:

import numpy as np
id_vars = df.columns.difference(['from','to'])
cols = ['id','date']
out = (df.melt(id_vars=id_vars, value_name='cum_sum')
       .groupby(cols)['cum_sum'].agg(list)
       .groupby(level=0).apply(np.cumsum)
       .apply(lambda x: len(set(x))).reset_index())    

Output:

  id        date  cum_sum
0  A  01/01/2020        3
1  A  01/02/2020        6
2  B  01/02/2020        2
3  B  01/03/2020        6
4  C  01/02/2020        3
5  C  01/03/2020        6
6  C  01/04/2020        7
  • Related