I have a DF1 where each row represents an "event". Each event has the columns "user", and "time":
DF1:
"user","time"
user1,2022-11-14 00:00:04
user2,2022-11-16 21:34:45
user1,2022-11-14 00:15:22
user3,2022-11-17 15:32:25
...
The "time" value is any timestamp in one week: from 2022-11-14 and 2022-11-20. There are 10k different users, and 27M events.
I have to divide the week in 8h time-slots (so 21 slots in total), and for each user, I need to look if that I can see any event of that user in each slot.
Then, I should create a DF2 (in which each row is a user) with 21 columns (one for each slot), with numbers 0 or 1: 0 if I have not seen the user in that slot, and 1 if I have seen the user in that slot.
DF2:
"user","slot1","slot2","slot3",...,"slot21"
user1,1,0,0,0,0,0,...,0
user2,0,0,1,1,1,0,...,0
user3,1,1,1,0,0,1,...,1
...
(After that, I will need to order DF2 and plot it as an spare matrix, but that is another story...)
I have managed to fill 1 single row of DF2, but it lasts 30 seconds for 1 user, in this way:
slots = pd.date_range('2022-11-14', periods=22, freq='8h')
row=[]
for i in np.arange(0,slots.value_counts().sum()-1):
if DF1[(DF1.user=="user1")&(DF1.time.between(slots[i],slots[i 1]))].shape[0] >=1:
row.append(1)
else:
row.append(0)
print(row)
So making this process for the 10k users would last almost 4 days...
Anyone have an idea on how can I achieve to create DF2 in a quicker way?? Maybe something like DF1.groupby('user').time and then what else? I can be done in pandas or with any other way, or even different languages, if I get the spare matrix in DF2!
Any help would be much appreciated!
CodePudding user response:
Use crosstab
with cut
for count values, if need 0,1
ouput compare for not equal 0
and cast to integers:
df = (pd.crosstab(df['user'],
pd.cut(df['time'], bins=slots, labels=False))
.ne(0)
.astype(int)
.add_prefix('slot'))