I have data
like this:
data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
2012053,2012054,2012071,2012073,2012074),
event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))
id
stands for personal id. yearmonthweek
means year, month and week. I want to clean data
by the following rules. First, find id
that have at least one event
. In this case id
=1 and 2 have events and id
=3 and 4 have no events. Secondly, pick a random row from an id
that has events and pick a random row from an id
that has no events. So, the number of rows should be same as the number of id
. My expected output looks like this:
data<-data.frame(id=c(1,2,3,4),
yearmonthweek=c(2012053,2013052,2012052,2012073),
event=c(1,1,0,0),
a=c(12,12,14,10))
Since I use random sampling, the values can be different as above, but there should be 4 rows like this.
CodePudding user response:
Here is an option
set.seed(2022)
data %>%
group_by(id) %>%
mutate(has_event = any(event == 1)) %>%
filter(if_else(has_event, event == 1, event == 0)) %>%
slice_sample(n = 1) %>%
select(-has_event) %>%
ungroup()
## A tibble: 4 × 4
# id yearmonthweek event a
# <dbl> <dbl> <dbl> <dbl>
#1 1 2012061 1 13
#2 2 2013052 1 12
#3 3 2012053 0 13
#4 4 2012074 0 20
Explanation: Group by id
, flag if a group has at least one event; if it does, only keep those rows where event == 1
; then uniform-randomly select a single row using slice_sample
per group.
CodePudding user response:
Here is a dplyr
way in two steps.
data <- data.frame(id=c(1,1,1,1,2,2,2,3,3,3,4,4,4),
yearmonthweek=c(2012052,2012053,2012061,2012062,2013031,2013052,2013053,2012052,
2012053,2012054,2012071,2012073,2012074),
event=c(0,1,1,0,0,1,0,0,0,0,0,0,0),
a=c(11,12,13,10,11,12,15,14,13,15,19,10,20))
suppressPackageStartupMessages(
library(dplyr)
)
bind_rows(
data %>%
filter(event != 0) %>%
group_by(id) %>%
sample_n(size = 1),
data %>%
group_by(id) %>%
mutate(event = !all(event == 0)) %>%
filter(!event) %>%
sample_n(size = 1)
)
#> # A tibble: 4 × 4
#> # Groups: id [4]
#> id yearmonthweek event a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2012061 1 13
#> 2 2 2013052 1 12
#> 3 3 2012054 0 15
#> 4 4 2012071 0 19
Created on 2022-10-21 with reprex v2.0.2