I have the following data. I want to create a column which is a cumulative sum of number of unique values from both "from" and "to" columns by date.
Current dataframe
id date from to
A 01/01/2020 alice bob
A 01/01/2020 alice carl
A 01/02/2020 jim peter
A 01/02/2020 jim jen
B 01/02/2020 mary john
B 01/03/2020 john max
B 01/03/2020 mike jane
B 01/03/2020 jane jim
C 01/02/2020 xavier jim
C 01/02/2020 jim alice
C 01/03/2020 alice bob
C 01/03/2020 manny lily
C 01/04/2020 rosa bob
. . . .
.
.
.
.
.
I want
id date cum_sum
A 01/01/2020 3
A 01/02/2020 6
B 01/02/2020 2
B 01/03/2020 7
C 01/02/2020 3
C 01/03/2020 6
C 01/04/2020 7
. . .
.
.
.
.
.
I tried something like the following
cum_sum = df.groupby(['id', pd.Grouper(freq='D', key='date')])['from', 'to_'].nunique.cumsum()
but no success. any help will be appreciated!
UPDATE i have tried this which works but I can only add one column "from". Any idea how to add "to" as well??
df1 = (df.groupby(['id','date'])['from']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cum_sum'))
CodePudding user response:
You could melt
the DataFrame, groupby.agg
to construct lists for each id-date pair, then use again groupby
to apply numpy.cumsum
to concatenate the lists. Then use apply
yet again to get the length of unique sets for each id:
import numpy as np
id_vars = df.columns.difference(['from','to'])
cols = ['id','date']
out = (df.melt(id_vars=id_vars, value_name='cum_sum')
.groupby(cols)['cum_sum'].agg(list)
.groupby(level=0).apply(np.cumsum)
.apply(lambda x: len(set(x))).reset_index())
Output:
id date cum_sum
0 A 01/01/2020 3
1 A 01/02/2020 6
2 B 01/02/2020 2
3 B 01/03/2020 6
4 C 01/02/2020 3
5 C 01/03/2020 6
6 C 01/04/2020 7