Home > Software engineering >  R Subsetting by Cumulative Percentage Up to AND INCLUDING a certain value
R Subsetting by Cumulative Percentage Up to AND INCLUDING a certain value

Time:07-07

I am trying to subset up to a percentage plus one more row for each grouped category.

When I filter by doing the following:

df <- df %>% 
   group_by(yyyymm)
df[df$cumperc <= .50, ]

I get:

ind yyyymm cumperc name
 1 202006   0.196 CHILD 
 2 202006   0.327 WOMAN 
 3 202006   0.401 MAN 
 4 202006   0.461 PET 
 5 202007   0.157 CHILD     
 6 202007   0.265 MAN
 7 202007   0.369 WOMAN
 8 202007   0.459 PET
 9 202007   0.494 FRIEND

I instead want:

ind yyyymm cumperc name
 1 202006   0.196 CHILD 
 2 202006   0.327 WOMAN 
 3 202006   0.401 MAN 
 4 202006   0.461 PET 
 5 202006   0.504 FRIEND
 6 202007   0.157 CHILD     
 7 202007   0.265 MAN
 8 202007   0.369 WOMAN
 9 202007   0.459 PET
 10 202007  0.494 FRIEND
 11 202007  0.519 ENEMY

which includes the values up to 0.50 and one more row.

Any advice how to systematically do this for all groups (in this instance the yyyymm column)?

Full (example) df for reference:

ind yyyymm cumperc name
 1 202006   0.196 CHILD 
 2 202006   0.327 WOMAN 
 3 202006   0.401 MAN 
 4 202006   0.461 PET 
 5 202006   0.504 FRIEND
 6 202006   0.604 ENEMY
 7 202006   0.845 PLACE
 8 202006   1.000 ITEM
 9 202007   0.157 CHILD     
 10 202007   0.265 MAN
 11 202007   0.369 WOMAN
 12 202007   0.459 PET
 13 202007  0.494 FRIEND
 14 202007  0.519 ENEMY
 15 202007  0.766 ITEM
 16 202007  1.000 PLACE

CodePudding user response:

You could use lag

# Sample data
df <- data.frame(
    grp = rep(letters[1:3], each = 5), cumperc  = rep(seq(0, 1, length.out = 5), 3))

df %>%
    group_by(grp) %>%
    filter(lag(cumperc, default = 0) < 0.5) %>%
    ungroup()
## A tibble: 9 x 2
#  grp   cumperc
#  <chr>   <dbl>
#1 a        0   
#2 a        0.25
#3 a        0.5 
#4 b        0   
#5 b        0.25
#6 b        0.5 
#7 c        0   
#8 c        0.25
#9 c        0.5 

This retains all rows where cumperc < 0.5 plus one additional row. Note this assumes that cumperc values are sorted by group, from smallest to largest.

CodePudding user response:

Here is a potential dplyr solution:

library(dplyr)

df <- read.table(text = "ind yyyymm cumperc name
 1 202006   0.196 CHILD 
 2 202006   0.327 WOMAN 
 3 202006   0.401 MAN 
 4 202006   0.461 PET 
 5 202006   0.504 FRIEND
 6 202006   0.604 ENEMY
 7 202006   0.845 PLACE
 8 202006   1.000 ITEM
 9 202007   0.157 CHILD     
 10 202007   0.265 MAN
 11 202007   0.369 WOMAN
 12 202007   0.459 PET
 13 202007  0.494 FRIEND
 14 202007  0.519 ENEMY
 15 202007  0.766 ITEM
 16 202007  1.000 PLACE",
 header = TRUE)

df %>%
  group_by(yyyymm) %>%
  filter(lag(cumperc, default = first(cumperc)) <= .50)
#> # A tibble: 11 × 4
#> # Groups:   yyyymm [2]
#>      ind yyyymm cumperc name  
#>    <int>  <int>   <dbl> <chr> 
#>  1     1 202006   0.196 CHILD 
#>  2     2 202006   0.327 WOMAN 
#>  3     3 202006   0.401 MAN   
#>  4     4 202006   0.461 PET   
#>  5     5 202006   0.504 FRIEND
#>  6     9 202007   0.157 CHILD 
#>  7    10 202007   0.265 MAN   
#>  8    11 202007   0.369 WOMAN 
#>  9    12 202007   0.459 PET   
#> 10    13 202007   0.494 FRIEND
#> 11    14 202007   0.519 ENEMY

Created on 2022-07-07 by the reprex package (v2.0.1)

Would that suit your use-case?

  • Related