Find the average amount of rows in a multi-index and remove any after a certain threshold?-CodePudding

I have a multi-index df:

This df outlines somebodies path through a website, sid is the session, vid is the visitor id, pid are the web pages and ts is the time in which they landed on the site

           pid    ts
sid vid 
 1   A    page1    t1
     A    page2    t2
     A    page3    t3
     A    page4    t4
     A    page5    t5
 2   B    page1    t4
 3   C    page1    t5
     C    page2    t6

Some users have ridiculously long pid paths (1000 ) which I imagine could be an error. However when I transpose/pivot this data, it takes ages to transpose because a few paths which are so long.

So I want to take the average pid for every session (sid) and impose some threshold where for every session after some number (lets say for example the average) it deletes those rows. So above the average number of rows per session is (5 1 2)/3 = 8/3

Now that I have the average, I can impose a threshold for the amount of rows which equals lets say 3, then the df would look like this:

           pid    ts
sid vid 
 1   A    page1    t1
     A    page2    t2
     A    page3    t3
 2   B    page1    t4
 3   C    page1    t5
     C    page2    t6

Any idea on how to do this?

CodePudding user response：

Use head:

thresh = int(np.ceil(df.groupby('sid').size().mean()))

df.groupby(['sid','vid']).head(thresh)

Output:

           pid  ts
sid vid           
1   A    page1  t1
    A    page2  t2
    A    page3  t3
2   B    page1  t4
3   C    page1  t5
    C    page2  t6