I want to query a database
using a character vector query_list
, and return a dataframe query_output
. In this two-column output dataframe, each row corresponds to a single string from the query vector. First of the two columns in the output dataframe (called term
) names this string, and the second column (called enzyme
) lists all rows of the database
in which the query string was found, defined by the column enzyme
from the database).
My query and database look as follows:
query_list <- c("term(A)", "term(B)", "term(C)", "term(D15)")
database <- data.frame(enzyme = c("A1", "B1", "C1", "D1", "E1")
,term = c("term(A);term(K);term(Y);term(G);term(F);"
,"term(A);term(K);term(Y);term(G);term(F);"
,"term(H);term(K);term(Y);term(C);term(F);"
,"term(H);term(B);term(Y);term(C);term(F);"
,"term(H);term(K);term(D15);term(G);term(F);"))
the database looks like this:
enzyme term
1 A1 term(A);term(K);term(Y);term(G);term(F);
2 B1 term(A);term(K);term(Y);term(G);term(F);
3 C1 term(H);term(K);term(Y);term(C);term(F);
4 D1 term(H);term(B);term(Y);term(C);term(F);
5 E1 term(H);term(K);term(D15);term(G);term(F);
The resulting dataframe query_output
:
> query_output
term enzyme
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
Optimally, the solution would be pipeable, and not a loop (although anything will be appreciated). I don't say what I have tried because I don't really know how to go about it in a concise way.
CodePudding user response:
Try this:
library(tidyr)
database %>%
mutate(term = sub(";$", "", term)) %>%
separate_rows(term, sep = ";") %>%
filter(term %in% query_list) %>%
group_by(term) %>%
summarise(enzyme = toString(enzyme))
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
Using separate_rows()
from tidyr
package you can separate the values in term
. Then, just filter by your query_list
, group by term
and use paste0(..., collapse=';')
to collapse all values for each term
in the same row.
database %>%
tidyr::separate_rows(term,sep=";") %>%
filter(term %in% query_list) %>%
group_by(term) %>%
summarise(enzyme = paste0(enzyme,collapse=', '))
Output:
# A tibble: 4 x 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
You can transform your database to your desired format
library(dplyr)
library(tidyr)
transformed_database <- database %>%
separate_rows(term, sep = ';') %>%
filter(term != '') %>%
group_by(term) %>%
summarise(enzyme = paste0(enzyme, collapse = ', '))
transformed_database
#> # A tibble: 9 × 2
#> term enzyme
#> <chr> <chr>
#> 1 term(A) A1, B1
#> 2 term(B) D1
#> 3 term(C) C1, D1
#> 4 term(D15) E1
#> 5 term(F) A1, B1, C1, D1, E1
#> 6 term(G) A1, B1, E1
#> 7 term(H) C1, D1, E1
#> 8 term(K) A1, B1, C1, E1
#> 9 term(Y) A1, B1, C1, D1
Then, querying it is as simple as
transformed_database %>%
filter(term %in% query_list)
#> # A tibble: 4 × 2
#> term enzyme
#> <chr> <chr>
#> 1 term(A) A1, B1
#> 2 term(B) D1
#> 3 term(C) C1, D1
#> 4 term(D15) E1
CodePudding user response:
We can iterate through the query_list
in base R, and use enflame
from tibble
to make it a dataframe.
library(tibble)
enframe(sapply(query_list, function(x)
paste(database[grepl(x, strsplit(database$term, ";"), fixed = T), 1], collapse = ", ")),
name = "term",
value = "enzyme")
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
My solution is:
database %>%
separate_rows(term, sep = ";") %>%
filter(term != "") %>%
filter(term %in% query_list) %>%
print() %>%
group_by(term) %>%
summarise(enzyme = str_c(enzyme, collapse = ", ")) %>%
ungroup()
Which results in
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
CodePudding user response:
Other answers already satisfy the OP's condition. An alternative approach, rethinking the input and output, might be to use environments or hashtab
and a memoised function. This takes some setting up and wrangling.
x <- unlist(strsplit(database$term, ";"))
keys <- unique(x)
value <- vector("list", length(keys))
ind <- rep(paste0(LETTERS[1:5], "1"), each = 5)
for (i in seq_along(keys)) {
value[[i]] = ind[which(x == keys[i])]
}
db <- value |> setNames(keys) |> list2env()
grab <- memoise::memoise(Vectorize(function(db, key) db[[key]], c("key")))
The outcome is a pipeable function that will scale quite well for similar queries.
db |>
grab(query_list)
#> $`term(A)`
#> [1] "A1" "B1"
#>
#> $`term(B)`
#> [1] "D1"
#>
#> $`term(C)`
#> [1] "C1" "D1"
#>
#> $`term(D15)`
#> [1] "E1"
For example, what if there were a million items in the query?
large_query <- rep(query_list, 1e6)
bench::mark(
x = grab(db, large_query)
)
#> expression median mem_alloc n_itr total_time
#> <bch:expr> <bch:tm> <bch:byt> <int> <bch:tm>
#> 1 x 160ms 140MB 4 643ms