I am working with a data frame in R in which a column contain gene IDs separated by bars that look like the following:
> geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")
> geneIDs
[1] "100/1000/100008586" "1277/63923/8516" "1133/1132/1956/8516"
I need to convert each of the different geneIDs to Gene Symbol based on a data.frame that contains in each row the geneID and its correspondent Gene Symbol, as depicted bellow:
> head(gene_symbols)
ENTREZID SYMBOL
1 1 A1BG
2 10 NAT2
3 100 ADA
4 1000 CDH2
5 10000 AKT3
6 100008586 GAGE12F
Using the first element from the geneIDs as an example, my expected outcome would look like:
> geneIDs
[1] "ADA/CDH2/GAGE12F"
Thank you very much in advance!
CodePudding user response:
Possible solution:
geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")
lookupTable <- structure(list(ENTREZID = c(1L, 10L, 100L, 1000L, 10000L, 100008586L
), SYMBOL = c("A1BG", "NAT2", "ADA", "CDH2", "AKT3", "GAGE12F"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
)) %>%
mutate(ENTREZID = as.character(ENTREZID))
as_tibble(x = geneIDs) %>%
mutate(value = strsplit(geneIDs, split = "/")) %>%
unnest_longer(value) %>%
left_join(lookupTable, by = c("value" = "ENTREZID"))
Which gives:
# A tibble: 10 × 2
value SYMBOL
<chr> <chr>
1 100 ADA
2 1000 CDH2
3 100008586 GAGE12F
4 1277 NA
5 63923 NA
6 8516 NA
7 1133 NA
8 1132 NA
9 1956 NA
10 8516 NA
Or to return exactly what you specified:
geneString <- as_tibble(x = geneIDs) %>%
mutate(value = strsplit(geneIDs, split = "/")) %>%
unnest_longer(value) %>%
left_join(lookupTable, by = c("value" = "ENTREZID")) %>%
filter(!is.na(SYMBOL)) %>%
pull(SYMBOL)
paste(geneString, collapse = "/")
"ADA/CDH2/GAGE12F"
CodePudding user response:
You could split the strings at the /
and match
each to the ENTREZID
column to look up the SYMBOL
. Replace any non-matches with the original string fragment, and paste the result together, collapsing with "/"
sapply(strsplit(geneIDs, '/'), function(x) {
y <- gene_symbols$SYMBOL[match(x, gene_symbols$ENTREZID)]
y[is.na(y)] <- x[is.na(y)]
paste0(y, collapse = '/')
})
#> [1] "ADA/CDH2/GAGE12F" "1277/63923/8516" "1133/1132/1956/8516"
CodePudding user response:
You can do this:
library(tidyverse)
geneIDs %>%
map(~ {vec <- df$SYMBOL[df$ENTREZID %in% unlist(str_split(.x, '/'))]
if(length(vec) > 0) {
paste(vec, collapse = '/')
}}) %>%
keep(~ length(.x) > 0)
[[1]]
[1] "ADA/CDH2/GAGE12F"
CodePudding user response:
Perhaps gsubfn
can be used here
library(gsubfn)
library(tibble)
gsubfn("\\d ", as.list(deframe(gene_symbols)), geneIDs)
[1] "ADA/CDH2/GAGE12F" "1277/63923/8516" "1133/1132/1956/8516"
data
gene_symbols <- structure(list(ENTREZID = c(1L, 10L, 100L, 1000L,
10000L, 100008586L
), SYMBOL = c("A1BG", "NAT2", "ADA", "CDH2", "AKT3", "GAGE12F"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6"))
geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")