I'm with a fundamental question in R:
Considering that I have a data frame, where each column represent the set of nucleotide mutations into two samples 'major' and 'minor'
major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")
df <- data.frame(major,minor)
tibble(df)
#A tibble: 1 x 2
major minor
<chr> <chr>
1 T2A,C26T,G652A T2A,C26T,G652A,C725T
And I want to identify the mutations present in 'minor' that aren't in 'major'.
I know that if those 'major' and 'minor' mutations were stored vectors, I could use setdiff to get this difference, but, the data that I received is stored as a long string with some mutations separated by comma, and I don't know how transform this column string to a column vector in the data frame to get this difference (I tried without success).
using the setdiff directly in the columns:
setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"
The expected results was:
C725T
Could anyone help me?
Best,
CodePudding user response:
This works on a multi-row data frame, doing comparisons by row:
library(dplyr)
major <- c("T2A,C26T,G652A", "world")
minor <- c("T2A,C26T,G652A,C725T", "hello,world")
df <- data.frame(major,minor)
df %>%
mutate(
across(c(major, minor), strsplit, split = ",")
) %>%
mutate(
diff = mapply(setdiff, minor, major)
)
# major minor diff
# 1 T2A, C26T, G652A T2A, C26T, G652A, C725T C725T
# 2 world hello, world hello
Note that it does modify the major
and minor
columns, turning them into list columns containing character vectors within each row. You can use the .names
argument to across
if you need to keep the originals.
CodePudding user response:
Easiest way to do this; define major
and minor
as character vector
major <- c("T2A", "C26T", "G652A")
and
minor <- c("T2A", "C26T", "G652A", "C725T")
then
df <- tibble(major, minor)
setdiff(df$minor, df$major)
#> "C725T"
If not possible to split major and minor as character vector, you can use stringr
package to do that job.
library(stringr)
major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")
df <- tibble(
major = str_split(major, pattern = ",", simplify = TRUE),
minor = str_split(minor, pattern = ",", simplify = TRUE)
)
setdiff(df$minor, df$major)
#> "C725T"