I have a dataset with over 600k obs in R. It looks something like this:
id <- c("58497484", "58544005", "74766653", "74764718", "62824455", "58497484", "58497484", "74766653")
key <- c("5718368_09/06/1981_3_2014", "2077485_02/06/1977_8_2014", "2091585_23/10/1982_1_2014", "2077388_30/01/2000_11_2017", "2082225_02/07/1998_10_2017", "2077450_04/05/2001_1_2016", "2077477_03/03/1978_8_2017", "2077388_30/01/2020_11_2019")
out <- c("2.1 - Reason 1", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.3 - Reason 3" , "1.2 - Reason 2", "3.6 - Reason 8")
data <- as.data.frame(cbind(id, key, out))
As you can notice, some of my ID numbers repeat themselves. What I need to do is group those IDs so I can see all obs that are grouped. I use this code to do so:
data <- data %>%
arrange(id) %>%
mutate(id = ifelse(duplicated(id), "", id))
When I do that, my df looks like this:
id key out
58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1
2077450_04/05/2001_1_2016 1.3 - Reason 3
2077477_03/03/1978_8_2017 1.2 - Reason 2
58544005 2077485_02/06/1977_8_2014 1.2 - Reason 2
62824455 2082225_02/07/1998_10_2017 1.2 - Reason 2
74764718 2077388_30/01/2000_11_2017 1.2 - Reason 2
74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2
2077388_30/01/2020_11_2019 3.6 - Reason 8
What I need to do now is filter my df using the variable "out", but I wanna keep all the registers grouped if "out" starts with 2 or 3, so my df would look like this:
id key out
58497484 5718368_09/06/1981_3_2014 2.1 - Reason 1
2077450_04/05/2001_1_2016 1.3 - Reason 3
2077477_03/03/1978_8_2017 1.2 - Reason 2
74766653 2091585_23/10/1982_1_2014 1.2 - Reason 2
2077388_30/01/2020_11_2019 3.6 - Reason 8
Is there any way I can do that? Thanks in advance.
CodePudding user response:
Instead of changing the duplicated
'id' to blank (''
), initially, do a group by filter
first and then do the change on the 'id'
library(dplyr)
library(stringr)
data %>%
group_by(id) %>%
dplyr::filter(any(str_detect(out, "^(2|3)"))) %>%
ungroup %>%
arrange(id) %>%
mutate(id = replace(id, duplicated(id), ""))
-output
# A tibble: 5 × 3
id key out
<chr> <chr> <chr>
1 "58497484" 5718368_09/06/1981_3_2014 2.1 - Reason 1
2 "" 2077450_04/05/2001_1_2016 1.3 - Reason 3
3 "" 2077477_03/03/1978_8_2017 1.2 - Reason 2
4 "74766653" 2091585_23/10/1982_1_2014 1.2 - Reason 2
5 "" 2077388_30/01/2020_11_2019 3.6 - Reason 8
CodePudding user response:
While akrun's is best approach I want to provide an alternative creating a helper column:
data %>%
group_by(id) %>%
summarise(key, out) %>%
mutate(x = as.integer(substr(out, 1,1))) %>%
filter(any(x %in% 2 | x %in% 3)) %>%
mutate(id = ifelse(duplicated(id), "", id)) %>%
select(-x)
id key out
<chr> <chr> <chr>
1 "58497484" 5718368_09/06/1981_3_2014 2.1 - Reason 1
2 "" 2077450_04/05/2001_1_2016 1.3 - Reason 3
3 "" 2077477_03/03/1978_8_2017 1.2 - Reason 2
4 "74766653" 2091585_23/10/1982_1_2014 1.2 - Reason 2
5 "" 2077388_30/01/2020_11_2019 3.6 - Reason 8