Home > Enterprise >  R tidy - remove duplicate substrings in column
R tidy - remove duplicate substrings in column

Time:11-24

I have one column that has important identifiers, but they are not uniform, so I am stuck with string splitting. I guess an example is best.

Code example, "df" is the tibble as is, result the result

a <- c('Text1','Text1','Text1','Text1_sub1','Text1_sub2','Text2','Text2','Text2_sub1_sub2','Text2_sub3')
b <- c("0"    ,"0"    ,"0"    ,NA          ,NA          ,"0"    ,"0"    ,NA               ,NA)
df <- tibble(a,b)


a <- c('Text1','Text1','Text1','Text1'     ,'Text1'     ,'Text2','Text2','Text2'          ,'Text2')
b <- c("0"    ,"0"    ,"0"    ,"sub1"      ,"sub2"      ,"0"    ,"0"    ,"sub1_sub2"      ,"sub3")
result <- tibble(a,b)

In Different formating:

> df
# A tibble: 9 x 2
  a               b    
  <chr>           <chr>
1 Text1           0    
2 Text1           0    
3 Text1           0    
4 Text1_sub1      NA   
5 Text1_sub2      NA   
6 Text2           0    
7 Text2           0    
8 Text2_sub1_sub2 NA   
9 Text2_sub3      NA   

> result
# A tibble: 9 x 2
  a     b        
  <chr> <chr>    
1 Text1 0        
2 Text1 0        
3 Text1 0        
4 Text1 sub1     
5 Text1 sub2     
6 Text2 0        
7 Text2 0        
8 Text2 sub1_sub2
9 Text2 sub3     

I want to use column as identifier, therefore I would like to take the unique strings in column a where b == 0, and then substract those from column a where b != 0, and transfer it to b. Any help highly appreciated, tidy solution would be nice, but any pointer will do, thanks!

CodePudding user response:

We can use separate:

df %>%
  separate(a, 
           into = c("a", "b_new"), 
           sep = "_",
           extra = "merge") %>%
  mutate(b = coalesce(b, b_new)) %>%
  select(-b_new)


# A tibble: 9 × 2
  a     b        
  <chr> <chr>    
1 Text1 0        
2 Text1 0        
3 Text1 0        
4 Text1 sub1     
5 Text1 sub2     
6 Text2 0        
7 Text2 0        
8 Text2 sub1_sub2
9 Text2 sub3  

CodePudding user response:

How about this:

library(dplyr)
library(glue)
a <- c('Text1','Text1','Text1','Text1_sub1','Text1_sub2','Text2','Text2','Text2_sub1_sub2','Text2_sub3')
b <- c("0"    ,"0"    ,"0"    ,NA          ,NA          ,"0"    ,"0"    ,NA               ,NA)
df <- tibble(a,b)
unstr <- df %>% 
  filter(b == "0") %>% 
  select(a) %>% 
  distinct() %>% 
  pull %>% 
  glue_collapse(sep="|")
result <- df %>% 
  mutate(b = case_when(is.na(b) ~ gsub(glue("{unstr}_(.*)"), "\\1", a), 
                       TRUE ~ b), 
         a = gsub(glue("({unstr})_.*"), "\\1", a))
result
#> # A tibble: 9 × 2
#>   a     b        
#>   <chr> <chr>    
#> 1 Text1 0        
#> 2 Text1 0        
#> 3 Text1 0        
#> 4 Text1 _sub1    
#> 5 Text1 _sub2    
#> 6 Text2 0        
#> 7 Text2 0        
#> 8 Text2 sub1_sub2
#> 9 Text2 sub3

Created on 2022-11-23 by the reprex package (v2.0.1)

  • Related