I am trying to subset up to a percentage plus one more row for each grouped category.
When I filter by doing the following:
df <- df %>%
group_by(yyyymm)
df[df$cumperc <= .50, ]
I get:
ind yyyymm cumperc name
1 202006 0.196 CHILD
2 202006 0.327 WOMAN
3 202006 0.401 MAN
4 202006 0.461 PET
5 202007 0.157 CHILD
6 202007 0.265 MAN
7 202007 0.369 WOMAN
8 202007 0.459 PET
9 202007 0.494 FRIEND
I instead want:
ind yyyymm cumperc name
1 202006 0.196 CHILD
2 202006 0.327 WOMAN
3 202006 0.401 MAN
4 202006 0.461 PET
5 202006 0.504 FRIEND
6 202007 0.157 CHILD
7 202007 0.265 MAN
8 202007 0.369 WOMAN
9 202007 0.459 PET
10 202007 0.494 FRIEND
11 202007 0.519 ENEMY
which includes the values up to 0.50 and one more row.
Any advice how to systematically do this for all groups (in this instance the yyyymm column)?
Full (example) df for reference:
ind yyyymm cumperc name
1 202006 0.196 CHILD
2 202006 0.327 WOMAN
3 202006 0.401 MAN
4 202006 0.461 PET
5 202006 0.504 FRIEND
6 202006 0.604 ENEMY
7 202006 0.845 PLACE
8 202006 1.000 ITEM
9 202007 0.157 CHILD
10 202007 0.265 MAN
11 202007 0.369 WOMAN
12 202007 0.459 PET
13 202007 0.494 FRIEND
14 202007 0.519 ENEMY
15 202007 0.766 ITEM
16 202007 1.000 PLACE
CodePudding user response:
You could use lag
# Sample data
df <- data.frame(
grp = rep(letters[1:3], each = 5), cumperc = rep(seq(0, 1, length.out = 5), 3))
df %>%
group_by(grp) %>%
filter(lag(cumperc, default = 0) < 0.5) %>%
ungroup()
## A tibble: 9 x 2
# grp cumperc
# <chr> <dbl>
#1 a 0
#2 a 0.25
#3 a 0.5
#4 b 0
#5 b 0.25
#6 b 0.5
#7 c 0
#8 c 0.25
#9 c 0.5
This retains all rows where cumperc < 0.5
plus one additional row. Note this assumes that cumperc
values are sorted by group, from smallest to largest.
CodePudding user response:
Here is a potential dplyr solution:
library(dplyr)
df <- read.table(text = "ind yyyymm cumperc name
1 202006 0.196 CHILD
2 202006 0.327 WOMAN
3 202006 0.401 MAN
4 202006 0.461 PET
5 202006 0.504 FRIEND
6 202006 0.604 ENEMY
7 202006 0.845 PLACE
8 202006 1.000 ITEM
9 202007 0.157 CHILD
10 202007 0.265 MAN
11 202007 0.369 WOMAN
12 202007 0.459 PET
13 202007 0.494 FRIEND
14 202007 0.519 ENEMY
15 202007 0.766 ITEM
16 202007 1.000 PLACE",
header = TRUE)
df %>%
group_by(yyyymm) %>%
filter(lag(cumperc, default = first(cumperc)) <= .50)
#> # A tibble: 11 × 4
#> # Groups: yyyymm [2]
#> ind yyyymm cumperc name
#> <int> <int> <dbl> <chr>
#> 1 1 202006 0.196 CHILD
#> 2 2 202006 0.327 WOMAN
#> 3 3 202006 0.401 MAN
#> 4 4 202006 0.461 PET
#> 5 5 202006 0.504 FRIEND
#> 6 9 202007 0.157 CHILD
#> 7 10 202007 0.265 MAN
#> 8 11 202007 0.369 WOMAN
#> 9 12 202007 0.459 PET
#> 10 13 202007 0.494 FRIEND
#> 11 14 202007 0.519 ENEMY
Created on 2022-07-07 by the reprex package (v2.0.1)
Would that suit your use-case?