I have a log of users and in which category it is logged. Users can be logged in multiple categories. I would like to determine which users are logged in multiple categories. The log is kinda long so preferably it would be sorted on users on top that have been logged in most categories.
user | category |
---|---|
1 | A |
1 | B |
2 | A |
3 | A |
3 | B |
Currently using group_by I'm only able to show the counts, but not the names of the categories. Most existing questions are usually asking it the other way around (e.g. how many users are there in a category). I would like to do something like this:
user | categories | count |
---|---|---|
1 | A, B | 2 |
2 | A | 1 |
3 | A, B | 2 |
To make it more complicated, I would like to determine the same but then I would also like to know if users are logged in multiple categories within specific time windows (e.g. 5 minutes):
timestamp | user | category |
---|---|---|
2021-12-12 13:00:00 | 1 | A |
2021-12-12 13:06:00 | 1 | B |
2021-12-12 13:08:00 | 2 | A |
2021-12-12 13:09:00 | 1 | B |
2021-12-12 13:14:00 | 3 | A |
2021-12-12 13:15:00 | 3 | B |
2021-12-12 13:15:00 | 3 | A |
2021-12-12 13:15:00 | 1 | B |
I would like something like below (but open to different display methods) so the categories per user are kinda binned per time-window (5 min in this example):
timestamp | user | categories | count |
---|---|---|---|
2021-12-12 13:00:00 | 1 | A | 1 |
2021-12-12 13:05:00 | 1 | B | 1 |
2021-12-12 13:05:00 | 2 | A | 1 |
2021-12-12 13:05:00 | 1 | B | 1 |
2021-12-12 13:10:00 | 3 | A | 1 |
2021-12-12 13:15:00 | 3 | B, A | 2 |
2021-12-12 13:15:00 | 1 | B | 1 |
I tried all kinds of things with combinations .resample, groub_by and .cumsum without any luck. I hope the examples make sense.
CodePudding user response:
Use GroupBy.agg
by column user
:
df1 = (df.groupby('user', as_index=False)
.agg(categories=('category', ','.join), counts=('category', 'size')))
print (df1)
user categories counts
0 1 A,B 2
1 2 A 1
2 3 A,B 2
For second my output is different - is used Grouper
with freq='5min'
:
df2 = (df.groupby(['user', pd.Grouper(freq='5min', key='timestamp')])
.agg(categories=('category', ','.join), counts=('category', 'size'))
.reset_index()
.sort_values('timestamp', ignore_index=True))
print (df2)
user timestamp categories counts
0 1 2021-12-12 13:00:00 A 1
1 1 2021-12-12 13:05:00 B,B 2
2 2 2021-12-12 13:05:00 A 1
3 3 2021-12-12 13:10:00 A 1
4 1 2021-12-12 13:15:00 B 1
5 3 2021-12-12 13:15:00 B,A 2
EDIT:
df3 = (df.groupby(['user', pd.Grouper(freq='5min', key='timestamp')])
.agg(categories=('category', lambda x: ','.join(set(x))),
counts=('category', 'nunique'))
.reset_index()
.sort_values('timestamp', ignore_index=True))
print (df3)
user timestamp categories counts
0 1 2021-12-12 13:00:00 A 1
1 1 2021-12-12 13:05:00 B 1
2 2 2021-12-12 13:05:00 A 1
3 3 2021-12-12 13:10:00 A 1
4 1 2021-12-12 13:15:00 B 1
5 3 2021-12-12 13:15:00 B,A 2