Consider the following data frame with two columns of strings of variable length:
library("tidyverse")
df <- tibble(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))
# # A tibble: 4 × 2
# REF ALT
# <chr> <chr>
# 1 TTG T
# 2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
# 3 T TTG
# 4 TTGTGTGTGTGTGTGTGTGTGT TTGTGTGTGTGTGTGTGTGTGTGT
Differently from column REF
, column ALT
sometimes includes several strings concatenated by comma (e.g. row 2).
I want to compare the length of strings in REF
(REF_LEN
) and ALT
(ALT_LEN
), and generate a TYPE
column with values:
- "SNM" when
REF_LEN
=ALT_LEN
- "INS" when
REF_LEN
<ALT_LEN
- "DEL" when
REF_LEN
>ALT_LEN
But I want to do it in a way that, when several strings are present in ALT
, the output of this new TYPE
column contains these types as well separated by a comma. i.e., the expected output here would be:
"DEL" "INS,DEL" "INS" "INS"
So far, I know how to get the length of values in ALT
, but I fail at collapsing these values, as the output will contain lengths from all ALT
s in the table, not just pairwise (i.e. 1,35,31,3,24
):
df %>%
dplyr::mutate(REF_LEN = str_length(REF),
ALT_LEN = str_split(ALT, ","),
ALT_LEN = purrr::map(ALT_LEN, str_length) %>% unlist() %>% paste(collapse = ","))
Code above is incomplete as you can see, but I am also unable to work in a different direction using a helper function that gets the TYPE
column above done. This will return many errors, but not sure why, since it seems to work nicely with values from ALT_LEN
individually:
name <- function(alt_lens, ref_len) {
alt_lens <- unlist(alt_lens)
ifelse(alt_lens < ref_len, "DEL", ifelse(alt_lens > ref_len, "INS", "SNM"))
}
df %>%
dplyr::mutate(REF_LEN = str_length(REF),
ALT_LEN = str_split(ALT, ","),
TYPE = purrr::map(ALT_LEN, str_length) %>% name(REF_LEN))
Any ideas? thanks!
CodePudding user response:
Here's a codegolf-ish base R solution :
df <- data.frame(REF = c("TTG", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "T", "TTGTGTGTGTGTGTGTGTGTGT"),
ALT = c("T", "CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT,CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT", "TTG", "TTGTGTGTGTGTGTGTGTGTGTGT"))
df$TYPE <- mapply(
function(x, y) paste(c("INS", "SNM", "DEL")[2 sign(nchar(x)- nchar(y))], collapse = ","),
df$REF, strsplit(df$ALT, ","), USE.NAMES = FALSE)
df$TYPE
#> [1] "DEL" "INS,DEL" "INS" "INS"
Created on 2022-04-20 by the reprex package (v2.0.1)
CodePudding user response:
Update: Removed first answer. Thanks to akrun for pointing me there!. The concept is the same: using nchar
with case_when
, the difference is to use separate_rows
from tidyr
package:
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>%
separate_rows(ALT, sep = ",") %>%
mutate(TYPE = case_when(nchar(REF)==nchar(ALT) ~ "SNM",
nchar(REF)< nchar(ALT) ~ "INS",
nchar(REF)> nchar(ALT) ~ "DEL",
TRUE ~ NA_character_)) %>%
group_by(id) %>%
mutate(TYPE = toString(TYPE)) %>%
slice(1)
REF ALT id TYPE
<chr> <chr> <int> <chr>
1 TTG T 1 DEL
2 CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT CGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT 2 INS, DEL
3 T TTG 3 INS
4 TTGTGTGTGTGTGTGTGTGTGT TTGTGTGTGTGTGTGTGTGTGTGT 4 INS