Determine in how many categories users are logged per time window unit-CodePudding

I have a log of users and in which category it is logged. Users can be logged in multiple categories. I would like to determine which users are logged in multiple categories. The log is kinda long so preferably it would be sorted on users on top that have been logged in most categories.

user	category
1	A
1	B
2	A
3	A
3	B

Currently using group_by I'm only able to show the counts, but not the names of the categories. Most existing questions are usually asking it the other way around (e.g. how many users are there in a category). I would like to do something like this:

user	categories	count
1	A, B	2
2	A	1
3	A, B	2

To make it more complicated, I would like to determine the same but then I would also like to know if users are logged in multiple categories within specific time windows (e.g. 5 minutes):

timestamp	user	category
2021-12-12 13:00:00	1	A
2021-12-12 13:06:00	1	B
2021-12-12 13:08:00	2	A
2021-12-12 13:09:00	1	B
2021-12-12 13:14:00	3	A
2021-12-12 13:15:00	3	B
2021-12-12 13:15:00	3	A
2021-12-12 13:15:00	1	B

I would like something like below (but open to different display methods) so the categories per user are kinda binned per time-window (5 min in this example):

timestamp	user	categories	count
2021-12-12 13:00:00	1	A	1
2021-12-12 13:05:00	1	B	1
2021-12-12 13:05:00	2	A	1
2021-12-12 13:05:00	1	B	1
2021-12-12 13:10:00	3	A	1
2021-12-12 13:15:00	3	B, A	2
2021-12-12 13:15:00	1	B	1

I tried all kinds of things with combinations .resample, groub_by and .cumsum without any luck. I hope the examples make sense.

CodePudding user response：

Use GroupBy.agg by column user:

df1 = (df.groupby('user', as_index=False)
         .agg(categories=('category', ','.join), counts=('category', 'size')))
print (df1)
   user categories  counts
0     1        A,B       2
1     2          A       1
2     3        A,B       2

For second my output is different - is used Grouper with freq='5min':

df2 = (df.groupby(['user', pd.Grouper(freq='5min', key='timestamp')])
         .agg(categories=('category', ','.join), counts=('category', 'size'))
         .reset_index()
         .sort_values('timestamp', ignore_index=True))
print (df2)
   user           timestamp categories  counts
0     1 2021-12-12 13:00:00          A       1
1     1 2021-12-12 13:05:00        B,B       2
2     2 2021-12-12 13:05:00          A       1
3     3 2021-12-12 13:10:00          A       1
4     1 2021-12-12 13:15:00          B       1
5     3 2021-12-12 13:15:00        B,A       2

EDIT:

df3 = (df.groupby(['user', pd.Grouper(freq='5min', key='timestamp')])
         .agg(categories=('category', lambda x: ','.join(set(x))), 
              counts=('category', 'nunique'))
         .reset_index()
         .sort_values('timestamp', ignore_index=True))
print (df3)
   user           timestamp categories  counts
0     1 2021-12-12 13:00:00          A       1
1     1 2021-12-12 13:05:00          B       1
2     2 2021-12-12 13:05:00          A       1
3     3 2021-12-12 13:10:00          A       1
4     1 2021-12-12 13:15:00          B       1
5     3 2021-12-12 13:15:00        B,A       2