Pandas groupby user and count number of events between 2 timestamps-CodePudding

I have a DF1 where each row represents an "event". Each event has the columns "user", and "time":

DF1:

"user","time"
user1,2022-11-14 00:00:04 
user2,2022-11-16 21:34:45
user1,2022-11-14 00:15:22
user3,2022-11-17 15:32:25
...

The "time" value is any timestamp in one week: from 2022-11-14 and 2022-11-20. There are 10k different users, and 27M events.

I have to divide the week in 8h time-slots (so 21 slots in total), and for each user, I need to look if that I can see any event of that user in each slot.

Then, I should create a DF2 (in which each row is a user) with 21 columns (one for each slot), with numbers 0 or 1: 0 if I have not seen the user in that slot, and 1 if I have seen the user in that slot.

DF2:

"user","slot1","slot2","slot3",...,"slot21"
user1,1,0,0,0,0,0,...,0
user2,0,0,1,1,1,0,...,0
user3,1,1,1,0,0,1,...,1
...

(After that, I will need to order DF2 and plot it as an spare matrix, but that is another story...)

I have managed to fill 1 single row of DF2, but it lasts 30 seconds for 1 user, in this way:

slots = pd.date_range('2022-11-14', periods=22, freq='8h')
row=[]
for i in np.arange(0,slots.value_counts().sum()-1):
    if DF1[(DF1.user=="user1")&(DF1.time.between(slots[i],slots[i 1]))].shape[0] >=1:
        row.append(1)
    else:
        row.append(0)
print(row)

So making this process for the 10k users would last almost 4 days...

Anyone have an idea on how can I achieve to create DF2 in a quicker way?? Maybe something like DF1.groupby('user').time and then what else? I can be done in pandas or with any other way, or even different languages, if I get the spare matrix in DF2!

Any help would be much appreciated!

CodePudding user response：

Use crosstab with cut for count values, if need 0,1 ouput compare for not equal 0 and cast to integers:

df = (pd.crosstab(df['user'], 
                 pd.cut(df['time'], bins=slots, labels=False))
        .ne(0)
        .astype(int)
        .add_prefix('slot'))