How to filter using conditions in R?-CodePudding

I have a dataset with over 600k obs in R. It looks something like this:

id <- c("58497484", "58544005", "74766653", "74764718", "62824455", "58497484", "58497484", "74766653")
key <- c("5718368_09/06/1981_3_2014", "2077485_02/06/1977_8_2014", "2091585_23/10/1982_1_2014", "2077388_30/01/2000_11_2017", "2082225_02/07/1998_10_2017", "2077450_04/05/2001_1_2016", "2077477_03/03/1978_8_2017", "2077388_30/01/2020_11_2019")
out <- c("2.1 - Reason 1", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.2 - Reason 2", "1.3 - Reason 3" , "1.2 - Reason 2", "3.6 - Reason 8")

data <- as.data.frame(cbind(id, key, out))

As you can notice, some of my ID numbers repeat themselves. What I need to do is group those IDs so I can see all obs that are grouped. I use this code to do so:

data <- data %>% 
  arrange(id) %>% 
  mutate(id = ifelse(duplicated(id), "", id))

When I do that, my df looks like this:

     id                  key                         out
    58497484     5718368_09/06/1981_3_2014      2.1 - Reason 1
                 2077450_04/05/2001_1_2016      1.3 - Reason 3
                 2077477_03/03/1978_8_2017      1.2 - Reason 2
    
    58544005     2077485_02/06/1977_8_2014      1.2 - Reason 2
    62824455     2082225_02/07/1998_10_2017     1.2 - Reason 2
    74764718     2077388_30/01/2000_11_2017     1.2 - Reason 2

    74766653     2091585_23/10/1982_1_2014      1.2 - Reason 2
                 2077388_30/01/2020_11_2019     3.6 - Reason 8

What I need to do now is filter my df using the variable "out", but I wanna keep all the registers grouped if "out" starts with 2 or 3, so my df would look like this:

  id                  key                         out
58497484     5718368_09/06/1981_3_2014      2.1 - Reason 1
             2077450_04/05/2001_1_2016      1.3 - Reason 3
             2077477_03/03/1978_8_2017      1.2 - Reason 2       
74766653     2091585_23/10/1982_1_2014      1.2 - Reason 2
             2077388_30/01/2020_11_2019     3.6 - Reason 8

Is there any way I can do that? Thanks in advance.

CodePudding user response：

Instead of changing the duplicated 'id' to blank (''), initially, do a group by filter first and then do the change on the 'id'

library(dplyr)
library(stringr)
data %>% 
   group_by(id) %>% 
   dplyr::filter(any(str_detect(out, "^(2|3)"))) %>% 
   ungroup %>%
   arrange(id) %>%
   mutate(id = replace(id, duplicated(id), ""))

-output

# A tibble: 5 × 3
  id         key                        out           
  <chr>      <chr>                      <chr>         
1 "58497484" 5718368_09/06/1981_3_2014  2.1 - Reason 1
2 ""         2077450_04/05/2001_1_2016  1.3 - Reason 3
3 ""         2077477_03/03/1978_8_2017  1.2 - Reason 2
4 "74766653" 2091585_23/10/1982_1_2014  1.2 - Reason 2
5 ""         2077388_30/01/2020_11_2019 3.6 - Reason 8

CodePudding user response：

While akrun's is best approach I want to provide an alternative creating a helper column:

  data %>% 
    group_by(id) %>% 
    summarise(key, out) %>% 
    mutate(x = as.integer(substr(out, 1,1))) %>% 
    filter(any(x %in% 2 | x %in% 3)) %>%     
    mutate(id = ifelse(duplicated(id), "", id)) %>% 
    select(-x)

  id         key                        out           
  <chr>      <chr>                      <chr>         
1 "58497484" 5718368_09/06/1981_3_2014  2.1 - Reason 1
2 ""         2077450_04/05/2001_1_2016  1.3 - Reason 3
3 ""         2077477_03/03/1978_8_2017  1.2 - Reason 2
4 "74766653" 2091585_23/10/1982_1_2014  1.2 - Reason 2
5 ""         2077388_30/01/2020_11_2019 3.6 - Reason 8