Home > OS >  How to replace strings separated by "/" for their match in another data.frame using R?
How to replace strings separated by "/" for their match in another data.frame using R?

Time:09-10

I am working with a data frame in R in which a column contain gene IDs separated by bars that look like the following:

> geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")
> geneIDs
[1] "100/1000/100008586" "1277/63923/8516"      "1133/1132/1956/8516" 

I need to convert each of the different geneIDs to Gene Symbol based on a data.frame that contains in each row the geneID and its correspondent Gene Symbol, as depicted bellow:

> head(gene_symbols)
   ENTREZID  SYMBOL
1         1    A1BG
2        10    NAT2
3       100     ADA
4      1000    CDH2
5     10000    AKT3
6 100008586 GAGE12F

Using the first element from the geneIDs as an example, my expected outcome would look like:

> geneIDs
[1] "ADA/CDH2/GAGE12F"

Thank you very much in advance!

CodePudding user response:

Possible solution:

geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")

lookupTable <- structure(list(ENTREZID = c(1L, 10L, 100L, 1000L, 10000L, 100008586L
), SYMBOL = c("A1BG", "NAT2", "ADA", "CDH2", "AKT3", "GAGE12F"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
)) %>% 
  mutate(ENTREZID = as.character(ENTREZID))


as_tibble(x = geneIDs) %>% 
  mutate(value = strsplit(geneIDs, split = "/")) %>% 
  unnest_longer(value) %>% 
  left_join(lookupTable, by = c("value" = "ENTREZID"))

Which gives:

# A tibble: 10 × 2
   value     SYMBOL 
   <chr>     <chr>  
 1 100       ADA    
 2 1000      CDH2   
 3 100008586 GAGE12F
 4 1277      NA     
 5 63923     NA     
 6 8516      NA     
 7 1133      NA     
 8 1132      NA     
 9 1956      NA     
10 8516      NA 

Or to return exactly what you specified:

geneString <- as_tibble(x = geneIDs) %>% 
  mutate(value = strsplit(geneIDs, split = "/")) %>% 
  unnest_longer(value) %>% 
  left_join(lookupTable, by = c("value" = "ENTREZID")) %>% 
  filter(!is.na(SYMBOL)) %>% 
  pull(SYMBOL)

paste(geneString, collapse = "/")
"ADA/CDH2/GAGE12F"

CodePudding user response:

You could split the strings at the / and match each to the ENTREZID column to look up the SYMBOL. Replace any non-matches with the original string fragment, and paste the result together, collapsing with "/"

sapply(strsplit(geneIDs, '/'), function(x) {
  y <- gene_symbols$SYMBOL[match(x, gene_symbols$ENTREZID)]
  y[is.na(y)] <- x[is.na(y)]
  paste0(y, collapse = '/')
})

#> [1] "ADA/CDH2/GAGE12F"    "1277/63923/8516"     "1133/1132/1956/8516"

CodePudding user response:

You can do this:

library(tidyverse)

geneIDs %>%
  map(~ {vec <- df$SYMBOL[df$ENTREZID %in% unlist(str_split(.x, '/'))]
  if(length(vec) > 0) {
    paste(vec, collapse = '/')
    }}) %>%
  keep(~ length(.x) > 0)


[[1]]
[1] "ADA/CDH2/GAGE12F"

CodePudding user response:

Perhaps gsubfn can be used here

library(gsubfn)
library(tibble)
gsubfn("\\d ", as.list(deframe(gene_symbols)), geneIDs)
[1] "ADA/CDH2/GAGE12F"    "1277/63923/8516"     "1133/1132/1956/8516"

data

gene_symbols <- structure(list(ENTREZID = c(1L, 10L, 100L, 1000L, 
10000L, 100008586L
), SYMBOL = c("A1BG", "NAT2", "ADA", "CDH2", "AKT3", "GAGE12F"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", 
"6"))
geneIDs <- c("100/1000/100008586","1277/63923/8516","1133/1132/1956/8516")
  • Related