Delete rows which have a certain amount of NA-CodePudding

I have a dataframe which after I finished transposing looks like this:

          step1      step2    step3     step4  ..... stepn-1     stepn
session_id 
  1       page_id   page_id    NA         NA             NA       NA
  2       page_id      NA      NA         NA             NA       NA
  3       page_id   page_id   page_id    page_id        page_id  page_id
  4       page_id   page_id   page_id     NA             NA       NA

For context, the page_id is some url substring and the steps indicate a users path throughout the tracked web pages

Some people steps are oddly large, perhaps 1000 pages which I imagine is very very uncommon with respect to the average

What I want to do is find the average steps each session has (non-NA steps)

So in the example above, if we exclude page 3 which we will define as an outlier, the average steps are (2 1 4 3)/4 = 10/4

Once I have an average I will get an idea of which rows I can exclude so the table becomes more manageable.

So overall

Find the average steps (non-nas) and use this number to delete rows which contain more than the average rows

CodePudding user response：

You can calculate the number of non NaNs with notna sum and get the mean. Finally, use dropna with this average as a threshold:

avg = df.notna().sum(axis=1).mean()
df.dropna(thresh=avg)

output:

              step1    step2    step3    step4  stepn-1    stepn
session_id                                                      
3           page_id  page_id  page_id  page_id  page_id  page_id
4           page_id  page_id  page_id      NaN      NaN      NaN