I am struggling with how to gain insight into two tables of information in R I have. I want to search to see if a string of characters in one data frame is present in another data frame. If it is, record the name for that string and append it to a new data frame.
Here's what I am working with:
df_repeats
sequence promoter_numbers promotors
1 AAAAAAAAAAAA 715 NA
2 AAAAAAAAAAAC 61 NA
3 AAAAAAAAAAAG 184 NA
df_promotors
gene promotor_coordinates sequence
1 Xkr4_1 range=chr1:3671549-36… GAGCTAGTTCTCTTTTCCCTGGTTACTAGCCATGTCCCTCCTCCCA…
2 Rp1_2 range=chr1:4360255-43… CACACACACACACACACACACACACATGTAACAATGAAACAAAAAG…
3 Rp1_1 range=chr1:4409254-44… AGGTATAACTTGGTAAAGACTTTGAAGTAAACAAGAACAAACAGCT…
I am trying to see which gene repeat sequences in df_repeats are present in the sequence column in df_promotors. My goal is to create a new data frame to be able to perform some visualizations. So I've been struggling to create something like the below (just as an example)
df_repeat_occurances
sequence promotor_numbers in_genes
1 AAAAAAAAAAAA 715 Rp1_2
2 AAAAAAAAAAAC 61 Xkr4_1, Rp1_2
3 AAAAAAAAAAAG 184 Xkr4_1
I tried to write a nested loop to search through and if there's a match, append it to the df_repeats in place of the NA, and then change the row names later, but I am completely lost on how to do this, or if it's an ideal way to combine the information of from the two tables into one. Here's what I tried and could not work through.
for (i in 1:nrow(df_repeats)) {
x = df_repeats$sequence[i]
for (j in 1:nrow(df_promotors)) {
if (grepl(x, df_promotors$sequence[j])) {
y = df_promotors$gene[j]
df_repeats$sequence[i] = c(df_repeats$sequence[i], " ", y)
}
}
}
First time ever posting and asking for help, so any guidance or pointers would be greatly appreciated!!!
CodePudding user response:
welcome to SO, in the future please include a reproducible example as I did below, including some meaningful names such as "result" etc. Also mark any acceptable answer as "accepted".
The best approach is to separate the different computation steps.
#First, define a reproducible example
sequences <- c("AAA", "BBB", "CCC", 'DDD')
promnb <- 1:4
result <- data.frame(sequences, promnb)
genes_names <- paste0("gene_", letters[1:4])
sequence <- c('BBB', 'ABC', 'AAA', 'AAA')
df_proms <- data.frame(genes_names, sequence)
# genes_names sequence
# 1 gene_a BBB
# 2 gene_b ABC
# 3 gene_c AAA
# 4 gene_d AAA
# 1: check in which genes each sequence is present using grepl
# sapply used with data.frames will by default apply the defined function over each column:
in_genes <- sapply(result$sequences, function(x) grepl(x, df_proms$sequence))
# AAA BBB CCC DDD
# [1,] FALSE TRUE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE
# [3,] TRUE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE FALSE
#2: replace TRUE or FALSE by the names of the genes
in_genes_names <- data.frame(ifelse(in_genes, paste0(genes_names), ""))
#3: finally, paste each column of the last df to get all the names of the genes
that contain this sequence
result$in_genes <- sapply(in_genes_names, paste, collapse = " ")
result$in_genes <- trimws(result$in_genes)
# By the way, you'd probably want to keep a list of the matches
# you can also include this list as a column of the result df
result$in_genes_list <- sapply(in_genes_names, list)
result
# sequences promnb in_genes in_genes_list
# 1 AAA 1 gene_c gene_d , , gene_c, gene_d
# 2 BBB 2 gene_a gene_a, , ,
# 3 CCC 3 , , ,
# 4 DDD 4 , , ,
CodePudding user response:
We could define a custom function f
to search needles in several haystacks, and subset haystacks based on indices of matches.
library(dplyr)
needles <- tribble(
~pattern, ~pn,
"AAA", 715,
"AAC", 61,
"AAG", 184
)
haystacks <- tribble(
~gene, ~sequence,
"Xkr4_1", "abcdefgAAGhijAAC",
"Rp1_2", "AACzyxAAAwv",
"Rp1_1", "nomatchinhere"
)
f = Vectorize(\(y, str, ptn) y[which(grepl(ptn, y[[str]])),1], "ptn")
# if you want the results as a list of vectors:
needles %>%
mutate(in_genes = f(haystacks, 2, pattern))
pattern pn in_genes
<chr> <dbl> <named list>
1 AAA 715 <chr [1]>
2 AAC 61 <chr [2]>
3 AAG 184 <chr [1]>
Then if needed, we can pivot longer by using unnest
. The resulting structure is probably easier to work with when visualizing your data later.
needles %>%
mutate(in_genes = f(haystacks, 2, pattern)) %>%
unnest(in_genes)
pattern pn in_genes
<chr> <dbl> <chr>
1 AAA 715 Rp1_2
2 AAC 61 Xkr4_1
3 AAC 61 Rp1_2
4 AAG 184 Xkr4_1
This relies on dplyr coercing to list where needed.
CodePudding user response:
You may try the following sapply
loop -
df_repeats$in_genes <- sapply(df_repeats$sequence, function(x)
toString(df_promotors$gene[grepl(x, df_promotors$sequence)]))