Get difference between column strings in R dataframe-CodePudding

I'm with a fundamental question in R:

Considering that I have a data frame, where each column represent the set of nucleotide mutations into two samples 'major' and 'minor'

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- data.frame(major,minor)
tibble(df)

#A tibble: 1 x 2
  major          minor               
  <chr>          <chr>               
1 T2A,C26T,G652A T2A,C26T,G652A,C725T

And I want to identify the mutations present in 'minor' that aren't in 'major'.

I know that if those 'major' and 'minor' mutations were stored vectors, I could use setdiff to get this difference, but, the data that I received is stored as a long string with some mutations separated by comma, and I don't know how transform this column string to a column vector in the data frame to get this difference (I tried without success).

using the setdiff directly in the columns:

setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"

The expected results was:

C725T

Could anyone help me?

Best,

CodePudding user response：

This works on a multi-row data frame, doing comparisons by row:

library(dplyr)
major <- c("T2A,C26T,G652A", "world")
minor <- c("T2A,C26T,G652A,C725T", "hello,world")

df <- data.frame(major,minor)

df %>%
  mutate(
    across(c(major, minor), strsplit, split = ",")
  ) %>%
  mutate(
    diff = mapply(setdiff, minor, major)
  )
#              major                   minor  diff
# 1 T2A, C26T, G652A T2A, C26T, G652A, C725T C725T
# 2            world            hello, world hello

Note that it does modify the major and minor columns, turning them into list columns containing character vectors within each row. You can use the .names argument to across if you need to keep the originals.

CodePudding user response：

Easiest way to do this; define major and minor as character vector

major <- c("T2A", "C26T", "G652A")

and

minor <- c("T2A", "C26T", "G652A", "C725T")

then

df <- tibble(major, minor)
setdiff(df$minor, df$major)
#> "C725T"

If not possible to split major and minor as character vector, you can use stringr package to do that job.

library(stringr)

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- tibble(
  major = str_split(major, pattern = ",", simplify = TRUE), 
  minor = str_split(minor, pattern = ",", simplify = TRUE)
)

setdiff(df$minor, df$major)
#> "C725T"