in R, concatenate a conditional vector of strings-CodePudding

I am struggling with how to gain insight into two tables of information in R I have. I want to search to see if a string of characters in one data frame is present in another data frame. If it is, record the name for that string and append it to a new data frame.

Here's what I am working with:

df_repeats

  sequence      promoter_numbers promotors  
1 AAAAAAAAAAAA  715              NA       
2 AAAAAAAAAAAC  61               NA       
3 AAAAAAAAAAAG  184              NA

df_promotors

  gene    promotor_coordinates    sequence                                       
1 Xkr4_1  range=chr1:3671549-36…  GAGCTAGTTCTCTTTTCCCTGGTTACTAGCCATGTCCCTCCTCCCA…
2 Rp1_2   range=chr1:4360255-43…  CACACACACACACACACACACACACATGTAACAATGAAACAAAAAG…
3 Rp1_1   range=chr1:4409254-44…  AGGTATAACTTGGTAAAGACTTTGAAGTAAACAAGAACAAACAGCT…

I am trying to see which gene repeat sequences in df_repeats are present in the sequence column in df_promotors. My goal is to create a new data frame to be able to perform some visualizations. So I've been struggling to create something like the below (just as an example)

df_repeat_occurances

  sequence      promotor_numbers   in_genes              
1 AAAAAAAAAAAA  715                Rp1_2
2 AAAAAAAAAAAC  61                 Xkr4_1, Rp1_2
3 AAAAAAAAAAAG  184                Xkr4_1

I tried to write a nested loop to search through and if there's a match, append it to the df_repeats in place of the NA, and then change the row names later, but I am completely lost on how to do this, or if it's an ideal way to combine the information of from the two tables into one. Here's what I tried and could not work through.

for (i in 1:nrow(df_repeats)) {
  x = df_repeats$sequence[i]
  for (j in 1:nrow(df_promotors)) {
    if (grepl(x, df_promotors$sequence[j])) {
      y = df_promotors$gene[j]
      df_repeats$sequence[i] = c(df_repeats$sequence[i], " ", y)
    }
  }
}

First time ever posting and asking for help, so any guidance or pointers would be greatly appreciated!!!

CodePudding user response：

welcome to SO, in the future please include a reproducible example as I did below, including some meaningful names such as "result" etc. Also mark any acceptable answer as "accepted".

The best approach is to separate the different computation steps.

#First, define a reproducible example
sequences <- c("AAA", "BBB", "CCC", 'DDD')
promnb <- 1:4
result <- data.frame(sequences, promnb)

genes_names <- paste0("gene_", letters[1:4]) 
sequence <- c('BBB', 'ABC', 'AAA', 'AAA')
df_proms <- data.frame(genes_names, sequence)
# genes_names sequence
# 1      gene_a      BBB
# 2      gene_b      ABC
# 3      gene_c      AAA
# 4      gene_d      AAA

# 1: check in which genes each sequence is present using grepl
# sapply used with data.frames will by default apply the defined function over each column:
in_genes <- sapply(result$sequences, function(x) grepl(x, df_proms$sequence))

# AAA   BBB   CCC   DDD
# [1,] FALSE  TRUE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE
# [3,]  TRUE FALSE FALSE FALSE
# [4,]  TRUE FALSE FALSE FALSE

#2: replace TRUE or FALSE by the names of the genes
in_genes_names <- data.frame(ifelse(in_genes, paste0(genes_names), ""))

#3: finally, paste each column of the last df to get all the names of the genes 
that contain this sequence

result$in_genes <- sapply(in_genes_names, paste, collapse = " ")
result$in_genes <- trimws(result$in_genes)

# By the way, you'd probably want to keep a list of the matches
# you can also include this list as a column of the result df
result$in_genes_list <- sapply(in_genes_names, list)
result
# sequences promnb      in_genes      in_genes_list
# 1       AAA      1 gene_c gene_d , , gene_c, gene_d
# 2       BBB      2        gene_a       gene_a, , , 
# 3       CCC      3                           , , , 
# 4       DDD      4                           , , ,

CodePudding user response：

We could define a custom function f to search needles in several haystacks, and subset haystacks based on indices of matches.

library(dplyr)
needles <- tribble(
  ~pattern, ~pn,
  "AAA", 715,
  "AAC", 61,
  "AAG", 184
)

haystacks <- tribble(
  ~gene, ~sequence,
  "Xkr4_1", "abcdefgAAGhijAAC",
  "Rp1_2", "AACzyxAAAwv",
  "Rp1_1", "nomatchinhere"
)

f = Vectorize(\(y, str, ptn) y[which(grepl(ptn, y[[str]])),1], "ptn")

# if you want the results as a list of vectors:
needles %>%
  mutate(in_genes = f(haystacks, 2, pattern))
  pattern    pn in_genes    
  <chr>   <dbl> <named list>
1 AAA       715 <chr [1]>   
2 AAC        61 <chr [2]>   
3 AAG       184 <chr [1]>

Then if needed, we can pivot longer by using unnest. The resulting structure is probably easier to work with when visualizing your data later.

needles %>%
  mutate(in_genes = f(haystacks, 2, pattern)) %>%
  unnest(in_genes)
  pattern    pn in_genes
  <chr>   <dbl> <chr>   
1 AAA       715 Rp1_2   
2 AAC        61 Xkr4_1  
3 AAC        61 Rp1_2   
4 AAG       184 Xkr4_1

This relies on dplyr coercing to list where needed.

CodePudding user response：

You may try the following sapply loop -

df_repeats$in_genes <- sapply(df_repeats$sequence, function(x) 
                   toString(df_promotors$gene[grepl(x, df_promotors$sequence)]))