Rowwise comparison of the length of a string against a list of string lengths-CodePudding

Consider the following data frame with two columns of strings of variable length:

library("tidyverse")

df <- tibble(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
             ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))

# # A tibble: 4 × 2
# REF                               ALT                                                                
# <chr>                             <chr>                                                              
# 1 TTG                               T                                                                  
# 2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
# 3 T                                 TTG                                                                
# 4 TTGTGTGTGTGTGTGTGTGTGT            TTGTGTGTGTGTGTGTGTGTGTGT

Differently from column REF, column ALT sometimes includes several strings concatenated by comma (e.g. row 2).

I want to compare the length of strings in REF (REF_LEN) and ALT (ALT_LEN), and generate a TYPE column with values:

"SNM" when REF_LEN = ALT_LEN
"INS" when REF_LEN < ALT_LEN
"DEL" when REF_LEN > ALT_LEN

But I want to do it in a way that, when several strings are present in ALT, the output of this new TYPE column contains these types as well separated by a comma. i.e., the expected output here would be:

"DEL"     "INS,DEL" "INS"     "INS"

So far, I know how to get the length of values in ALT, but I fail at collapsing these values, as the output will contain lengths from all ALTs in the table, not just pairwise (i.e. 1,35,31,3,24):

df %>%
  dplyr::mutate(REF_LEN = str_length(REF),
                ALT_LEN = str_split(ALT, ","),
                ALT_LEN = purrr::map(ALT_LEN, str_length) %>% unlist() %>% paste(collapse = ","))

Code above is incomplete as you can see, but I am also unable to work in a different direction using a helper function that gets the TYPE column above done. This will return many errors, but not sure why, since it seems to work nicely with values from ALT_LEN individually:

name <- function(alt_lens, ref_len) {
  alt_lens <- unlist(alt_lens)
  ifelse(alt_lens < ref_len, "DEL", ifelse(alt_lens > ref_len, "INS", "SNM"))
}

df %>%
  dplyr::mutate(REF_LEN = str_length(REF),
                ALT_LEN = str_split(ALT, ","),
                TYPE = purrr::map(ALT_LEN, str_length) %>% name(REF_LEN))

Any ideas? thanks!

CodePudding user response：

Here's a codegolf-ish base R solution :

df <- data.frame(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
             ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))

df$TYPE <- mapply(
  function(x, y) paste(c("INS", "SNM", "DEL")[2   sign(nchar(x)- nchar(y))], collapse = ","), 
  df$REF, strsplit(df$ALT, ","), USE.NAMES = FALSE)

df$TYPE
#> [1] "DEL"     "INS,DEL" "INS"     "INS"

^{Created on 2022-04-20 by the reprex package (v2.0.1)}

CodePudding user response：

Update: Removed first answer. Thanks to akrun for pointing me there!. The concept is the same: using nchar with case_when, the difference is to use separate_rows from tidyr package:

library(dplyr)
library(tidyr)

df %>% 
  mutate(id = row_number()) %>% 
  separate_rows(ALT, sep = ",") %>% 
  mutate(TYPE = case_when(nchar(REF)==nchar(ALT) ~ "SNM",
                             nchar(REF)< nchar(ALT) ~ "INS",
                             nchar(REF)> nchar(ALT) ~ "DEL",
                             TRUE ~ NA_character_)) %>% 
  group_by(id) %>% 
  mutate(TYPE = toString(TYPE)) %>% 
  slice(1)

 REF                               ALT                                    id TYPE    
  <chr>                             <chr>                               <int> <chr>   
1 TTG                               T                                       1 DEL     
2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT     2 INS, DEL
3 T                                 TTG                                     3 INS     
4 TTGTGTGTGTGTGTGTGTGTGT            TTGTGTGTGTGTGTGTGTGTGTGT                4 INS