I have a dataset with over 800k obs in R. It looks something like this:
id <- c("58497484", "58544005", "74766653", "74764718", "62824455", "58497484", "58497484", "74766653", "58544005")
key <- c("5718368_09/06/1981_3_2014", "2077485_02/06/1977_8_2014", "2091585_23/10/1982_1_2014", "2077388_30/01/2000_11_2017", "2082225_02/07/1998_10_2017",
"2077450_04/05/2001_1_2016", "2077477_03/03/1978_8_2017",
"2077388_30/01/2020_11_2019", "5718368_08/06/1982_3_2012")
out <- c("2.1 - Reason 1", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.3 - Reason 3" ,
"1.2 - Reason 2", "3.6 - Reason 8", "3.2 - Reason 10")
flag1 <- c("1", "0", "1", "1", "1", "1", "0", "1", "0")
flag2 <- c("0", "0", "0", "1", "0", "1", "0", "1", "0")
data <- as.data.frame(cbind(id, key, out, flag1, flag2))
Some of my ID numbers repeat themselves. What I need to do is group those IDs so I can see all obs that are grouped. I use this code to do so:
data <- data %>%
arrange(id) %>%
mutate(id = ifelse(duplicated(id), "", id))
When I do that, my df looks like this:
id key out flag1 flag2
58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0
2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1
2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0
58544005 2077485_02/06/1977_8_2014 1.2 - Reason 2 0 0
5718368_08/06/1982_3_2012 3.2 - Reason 10 0 0
62824455 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0
74764718 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1
74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0
2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1
What I need to do is DELETE clusters using "out", "flag1" and "flag2" variables as conditions. It goes like this: if any rows from my grouped id begin with "out" 2 or 3 AND if both "flag1" and "flag2" are equal to zero (0). My desired output is the following:
id key out flag1 flag2
58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0
2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1
2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0
62824455 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0
74764718 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1
74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0
2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1
Notice that the "58544005" id number is not present in the output because it shows both flags equal to zero and "out" reason 3.2. Thanks in advance.
CodePudding user response:
library(tidyverse)
data %>%
arrange(id)%>%
type.convert(as.is = TRUE)%>%
group_by(id) %>%
filter(!(if_all(flag1:flag2, ~sum(.x)==0)&
any(substr(out, 1, 1) %in% 2:3))) %>%
ungroup() %>%
mutate(id = replace(id, duplicated(id), ''))
# A tibble: 7 x 5
id key out flag1 flag2
<chr> <chr> <chr> <int> <int>
1 "58497484" 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0
2 "" 2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1
3 "" 2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0
4 "62824455" 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0
5 "74764718" 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1
6 "74766653" 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0
7 "" 2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1
CodePudding user response:
Based on your clarification in the comments, it looks like you want to delete all records for id's where any record for that id satisfies the three conditions. Here's the strategy I will use:
For each individual row, check if the condition for deletion is satisfied.
For each id, check if any of its records were set to be deleted. If so, remove the id.
library(tidyverse)
data %>%
# Step 1: Create an indicator for deletion for each record
mutate(
delete_ind =
str_sub(out, 1, 1) %in% c("2", "3") &
flag1 == 0 &
flag2 == 0
) %>%
# Step 2: Filter out all id's that satisfied the condition at least once
group_by(id) %>%
filter(sum(delete_ind) == 0) %>%
ungroup()
# A tibble: 7 x 6
id key out flag1 flag2 delete_ind
<chr> <chr> <chr> <chr> <chr> <lgl>
1 58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1 1 0 FALSE
2 74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2 1 0 FALSE
3 74764718 2077388_30/01/2000_11_2017 1.2 - Reason 2 1 1 FALSE
4 62824455 2082225_02/07/1998_10_2017 1.2 - Reason 2 1 0 FALSE
5 58497484 2077450_04/05/2001_1_2016 1.3 - Reason 3 1 1 FALSE
6 58497484 2077477_03/03/1978_8_2017 1.2 - Reason 2 0 0 FALSE
7 74766653 2077388_30/01/2020_11_2019 3.6 - Reason 8 1 1 FALSE