I have a dataframe which after I finished transposing looks like this:
step1 step2 step3 step4 ..... stepn-1 stepn
session_id
1 page_id page_id NA NA NA NA
2 page_id NA NA NA NA NA
3 page_id page_id page_id page_id page_id page_id
4 page_id page_id page_id NA NA NA
For context, the page_id
is some url substring and the steps indicate a users path throughout the tracked web pages
Some people steps are oddly large, perhaps 1000 pages which I imagine is very very uncommon with respect to the average
What I want to do is find the average steps each session has (non-NA steps)
So in the example above, if we exclude page 3 which we will define as an outlier, the average steps are (2 1 4 3)/4 = 10/4
Once I have an average I will get an idea of which rows I can exclude so the table becomes more manageable.
So overall
Find the average steps (non-nas) and use this number to delete rows which contain more than the average rows
CodePudding user response:
You can calculate the number of non NaNs with notna
sum
and get the mean
. Finally, use dropna
with this average as a threshold:
avg = df.notna().sum(axis=1).mean()
df.dropna(thresh=avg)
output:
step1 step2 step3 step4 stepn-1 stepn
session_id
3 page_id page_id page_id page_id page_id page_id
4 page_id page_id page_id NaN NaN NaN