I have a multi-index df:
This df outlines somebodies path through a website, sid
is the session, vid
is the visitor id, pid
are the web pages and ts
is the time in which they landed on the site
pid ts
sid vid
1 A page1 t1
A page2 t2
A page3 t3
A page4 t4
A page5 t5
2 B page1 t4
3 C page1 t5
C page2 t6
Some users have ridiculously long pid paths (1000 ) which I imagine could be an error. However when I transpose/pivot this data, it takes ages to transpose because a few paths which are so long.
So I want to take the average pid
for every session (sid
) and impose some threshold where for every session after some number (lets say for example the average) it deletes those rows. So above the average number of rows per session is (5 1 2)/3 = 8/3
Now that I have the average, I can impose a threshold for the amount of rows which equals lets say 3, then the df would look like this:
pid ts
sid vid
1 A page1 t1
A page2 t2
A page3 t3
2 B page1 t4
3 C page1 t5
C page2 t6
Any idea on how to do this?
CodePudding user response:
Use head
:
thresh = int(np.ceil(df.groupby('sid').size().mean()))
df.groupby(['sid','vid']).head(thresh)
Output:
pid ts
sid vid
1 A page1 t1
A page2 t2
A page3 t3
2 B page1 t4
3 C page1 t5
C page2 t6