Home > Mobile >  Edit : combination of one column without overlap and common variable
Edit : combination of one column without overlap and common variable

Time:03-08

Data updated!

I have a example data set

Target Start sequence
A y1 ccc
A y2 cct
A y3 aag
A y3 act
B y1 aaa
B y4 aat

and trying to get dataset like in R :

Target Start Start sequence
A y1 y2 ccc,cct
A y1 y3 ccc,aag,act
A y2 y3 cct,aag,act
B y1 y4 aaa,aat

Start column alway has a target and looking for common target from each combination of start column without any overlaps and its list of sequence. I have tried to manipulate with mutate() and comb() help with following link: link, however did not achieve wanted result.

Can anyone help me and give me a chance to learn further?

CodePudding user response:

You may achieve this by using combn for each group.

library(dplyr)
library(tidyr)

df %>%
  group_by(Target) %>%
  summarise(Start = combn(Start, 2, function(x) 
                           list(setNames(x, c('start', 'end')))), 
            Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
  unnest_wider(Start)

# Target start end   Sequence
#  <chr>  <chr> <chr> <chr>   
#1 A      y1    y2    ccc, cct
#2 A      y1    y3    ccc, aag
#3 A      y2    y3    cct, aag
#4 B      y1    y4    aaa, aat

CodePudding user response:

Here is another tidyverse approach without the use of combn().

  1. group_by(Target, Start) so that any sequence with same Target and Start can be collapsed to a single row
  2. Drop the Start column in group_by()
  3. Change the Start column into numeric, so that we can directly compare the Start values
  4. Create a Start2 column containing Start value greater than itself, and extract the corresponding sequence string and store in sequence2 column
  5. Expand the dataframe based on Start2 and sequence2 (since there would be multiple output per row by sapply)
  6. group_by(Target, Start, Start2) so that we can paste sequence with sequence2
library(tidyverse)

df %>% 
  group_by(Target, Start) %>% 
  summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>% 
  mutate(Start_num = as.numeric(str_extract(Start, "\\d ")),
         Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
         sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>% 
  unnest(cols = c(Start2, sequence2)) %>% 
  group_by(Target, Start, Start2) %>% 
  summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")

# A tibble: 4 × 4
  Target Start Start2 sequence   
  <chr>  <chr> <chr>  <chr>      
1 A      y1    y2     ccc,cct    
2 A      y1    y3     ccc,aag,act
3 A      y2    y3     cct,aag,act
4 B      y1    y4     aaa,aat     
  • Related