I have the following problem:
I want to create a new column in a data frame, based in de difference between two columns, where which row is a vector of strings:
My code:
library(dplyr) # v.1.0.7
seqs <- c("seq1","seq2","seq3","seq4","seq5")
expect_mut <- c("S:T20N,S:D614G","S:T20N,S:D614G","S:T20N,N:G204R,N:G80R", "N:G204R, S:D614G", "N:G204R, S:D614G")
observed_mut <- c("S:T20N","S:D164G","S:T20N, N:G204R","S:D614G,N:G204R","S:D164G,S:T19I")
data_frame <- data.frame(seqs, expect_mut, observed_mut)
data_frame <- data_frame %>%
mutate(expect_mut = strsplit(as.character(expect_mut), ","),
observed_mut = strsplit(as.character(observed_mut), ",")) %>%
group_by(seqs) %>%
mutate(diff_mut = setdiff(observed_mut, expect_mut))
What I expect:
| seqs | expect_mut | observed_mut | diff_mut |
| ----- | ---------------------------------- | ----------------------- | ------------ |
| seq1 | c("S:T20N", "S:D614G") | S:T20N | |
| seq2 | c("S:T20N", "S:D614G") | S:D164G | S:D164G |
| seq3 | c("S:T20N", "N:G204R", "N:G80R") | c("S:T20N", " N:G204R") | |
| seq4 | c("N:G204R", "S:D614G") | c("N:G204R", "S:D614G") | |
| seq5 | c("N:G204R", "S:D614G") | c("S:D164G", "S:T19I") | c("S:D164G", "S:T19I") |
What returns:
| seqs | expect_mut | observed_mut | diff_mut |
| ----- | ---------------------------------- | ----------------------- | ------------ |
| seq1 | c("S:T20N", "S:D614G") | S:T20N | S:T20N |
| seq2 | c("S:T20N", "S:D614G") | S:D164G | S:D164G |
| seq3 | c("S:T20N", "N:G204R", "N:G80R") | c("S:T20N", " N:G204R") | c("S:T20N", " N:G204R") |
| seq4 | c("N:G204R", "S:D614G") | c("N:G204R", "S:D614G") | c("N:G204R", "S:D614G") |
| seq5 | c("N:G204R", "S:D614G") | c("S:D164G", "S:T19I") | c("S:D164G", "S:T19I") |
Basically is returning the same value of observed_mut into diff_mut column...
CodePudding user response:
As both columns are list
after the strsplit
, use map2
to loop over the corresponding list
elements
library(dplyr)
library(purrr)
data_frame %>%
mutate(expect_mut = strsplit(as.character(expect_mut), ","),
observed_mut = strsplit(as.character(observed_mut), ",")) %>%
mutate(diff_mut = map2(observed_mut, expect_mut, setdiff)) %>%
as_tibble
-output
# A tibble: 5 × 4
seqs expect_mut observed_mut diff_mut
<chr> <list> <list> <list>
1 seq1 <chr [2]> <chr [1]> <chr [0]>
2 seq2 <chr [2]> <chr [1]> <chr [1]>
3 seq3 <chr [3]> <chr [2]> <chr [1]>
4 seq4 <chr [2]> <chr [2]> <chr [1]>
5 seq5 <chr [2]> <chr [2]> <chr [2]>
Or if we use the group_by
approach (assuming all elements in 'seqs' are distinct, extract the first list element with [[
data_frame %>%
mutate(expect_mut = strsplit(as.character(expect_mut), ","),
observed_mut = strsplit(as.character(observed_mut), ",")) %>%
group_by(seqs) %>%
mutate(diff_mut = list(setdiff(observed_mut[[1]], expect_mut[[1]]))) %>%
ungroup
-output
# A tibble: 5 × 4
seqs expect_mut observed_mut diff_mut
<chr> <list> <list> <list>
1 seq1 <chr [2]> <chr [1]> <chr [0]>
2 seq2 <chr [2]> <chr [1]> <chr [1]>
3 seq3 <chr [3]> <chr [2]> <chr [1]>
4 seq4 <chr [2]> <chr [2]> <chr [1]>
5 seq5 <chr [2]> <chr [2]> <chr [2]>
NOTE: rowwise
may be bug free compared to group_by
(in case there are duplicates for 'seqs')