Data updated!
I have a example data set
Target | Start | sequence |
---|---|---|
A | y1 | ccc |
A | y2 | cct |
A | y3 | aag |
A | y3 | act |
B | y1 | aaa |
B | y4 | aat |
and trying to get dataset like in R :
Target | Start | Start | sequence |
---|---|---|---|
A | y1 | y2 | ccc,cct |
A | y1 | y3 | ccc,aag,act |
A | y2 | y3 | cct,aag,act |
B | y1 | y4 | aaa,aat |
Start column alway has a target and looking for common target from each combination of start column without any overlaps and its list of sequence. I have tried to manipulate with mutate() and comb() help with following link: link, however did not achieve wanted result.
Can anyone help me and give me a chance to learn further?
CodePudding user response:
You may achieve this by using combn
for each group.
library(dplyr)
library(tidyr)
df %>%
group_by(Target) %>%
summarise(Start = combn(Start, 2, function(x)
list(setNames(x, c('start', 'end')))),
Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
unnest_wider(Start)
# Target start end Sequence
# <chr> <chr> <chr> <chr>
#1 A y1 y2 ccc, cct
#2 A y1 y3 ccc, aag
#3 A y2 y3 cct, aag
#4 B y1 y4 aaa, aat
CodePudding user response:
Here is another tidyverse
approach without the use of combn()
.
group_by(Target, Start)
so that any sequence with sameTarget
andStart
can be collapsed to a single row- Drop the
Start
column ingroup_by()
- Change the
Start
column into numeric, so that we can directly compare theStart
values - Create a
Start2
column containingStart
value greater than itself, and extract the correspondingsequence
string and store insequence2
column - Expand the dataframe based on
Start2
andsequence2
(since there would be multiple output per row bysapply
) group_by(Target, Start, Start2)
so that we canpaste
sequence
withsequence2
library(tidyverse)
df %>%
group_by(Target, Start) %>%
summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>%
mutate(Start_num = as.numeric(str_extract(Start, "\\d ")),
Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>%
unnest(cols = c(Start2, sequence2)) %>%
group_by(Target, Start, Start2) %>%
summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")
# A tibble: 4 × 4
Target Start Start2 sequence
<chr> <chr> <chr> <chr>
1 A y1 y2 ccc,cct
2 A y1 y3 ccc,aag,act
3 A y2 y3 cct,aag,act
4 B y1 y4 aaa,aat