Home > Software design >  Delete rows which have a certain amount of NA
Delete rows which have a certain amount of NA

Time:10-29

I have a dataframe which after I finished transposing looks like this:

          step1      step2    step3     step4  ..... stepn-1     stepn
session_id 
  1       page_id   page_id    NA         NA             NA       NA
  2       page_id      NA      NA         NA             NA       NA
  3       page_id   page_id   page_id    page_id        page_id  page_id
  4       page_id   page_id   page_id     NA             NA       NA
  

For context, the page_id is some url substring and the steps indicate a users path throughout the tracked web pages

Some people steps are oddly large, perhaps 1000 pages which I imagine is very very uncommon with respect to the average

What I want to do is find the average steps each session has (non-NA steps)

So in the example above, if we exclude page 3 which we will define as an outlier, the average steps are (2 1 4 3)/4 = 10/4

Once I have an average I will get an idea of which rows I can exclude so the table becomes more manageable.

So overall

Find the average steps (non-nas) and use this number to delete rows which contain more than the average rows

CodePudding user response:

You can calculate the number of non NaNs with notna sum and get the mean. Finally, use dropna with this average as a threshold:

avg = df.notna().sum(axis=1).mean()
df.dropna(thresh=avg)

output:

              step1    step2    step3    step4  stepn-1    stepn
session_id                                                      
3           page_id  page_id  page_id  page_id  page_id  page_id
4           page_id  page_id  page_id      NaN      NaN      NaN
  • Related