Edit : combination of one column without overlap and common variable-CodePudding

Data updated!

I have a example data set

Target	Start	sequence
A	y1	ccc
A	y2	cct
A	y3	aag
A	y3	act
B	y1	aaa
B	y4	aat

and trying to get dataset like in R :

Target	Start	Start	sequence
A	y1	y2	ccc,cct
A	y1	y3	ccc,aag,act
A	y2	y3	cct,aag,act
B	y1	y4	aaa,aat

Start column alway has a target and looking for common target from each combination of start column without any overlaps and its list of sequence. I have tried to manipulate with mutate() and comb() help with following link: link, however did not achieve wanted result.

Can anyone help me and give me a chance to learn further?

CodePudding user response：

You may achieve this by using combn for each group.

library(dplyr)
library(tidyr)

df %>%
  group_by(Target) %>%
  summarise(Start = combn(Start, 2, function(x) 
                           list(setNames(x, c('start', 'end')))), 
            Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
  unnest_wider(Start)

# Target start end   Sequence
#  <chr>  <chr> <chr> <chr>   
#1 A      y1    y2    ccc, cct
#2 A      y1    y3    ccc, aag
#3 A      y2    y3    cct, aag
#4 B      y1    y4    aaa, aat

CodePudding user response：

Here is another tidyverse approach without the use of combn().

group_by(Target, Start) so that any sequence with same Target and Start can be collapsed to a single row
Drop the Start column in group_by()
Change the Start column into numeric, so that we can directly compare the Start values
Create a Start2 column containing Start value greater than itself, and extract the corresponding sequence string and store in sequence2 column
Expand the dataframe based on Start2 and sequence2 (since there would be multiple output per row by sapply)
group_by(Target, Start, Start2) so that we can paste sequence with sequence2

library(tidyverse)

df %>% 
  group_by(Target, Start) %>% 
  summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>% 
  mutate(Start_num = as.numeric(str_extract(Start, "\\d ")),
         Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
         sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>% 
  unnest(cols = c(Start2, sequence2)) %>% 
  group_by(Target, Start, Start2) %>% 
  summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")

# A tibble: 4 × 4
  Target Start Start2 sequence   
  <chr>  <chr> <chr>  <chr>      
1 A      y1    y2     ccc,cct    
2 A      y1    y3     ccc,aag,act
3 A      y2    y3     cct,aag,act
4 B      y1    y4     aaa,aat