What gsub function can I use in R to get the gene name and the id number from a vector which looks l-CodePudding

head(colnames(cn), 20)

 [1] "A1BG (1)"               "NAT2 (10)"              "ADA (100)"              "CDH2 (1000)"            "AKT3 (10000)"           "GAGE12F (100008586)"   
 [7] "RNA5-8SN5 (100008587)"  "RNA18SN5 (100008588)"   "RNA28SN5 (100008589)"   "LINC02584 (100009613)"  "POU5F1P5 (100009667)"   "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)"           "NR2E3 (10002)"          "NAALAD2 (10003)"        "DUXB (100033411)"       "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)"

CodePudding user response：

1) Assuming the input s given in the Note at the end we can use read.table specifying that the fields are separated by ( and that ) is a comment character. We also strip white space around fields and give meaningful column names. No packages are used.

DF <- read.table(text = s, sep = "(", comment.char = ")", 
  strip.white = TRUE, col.names = c("Gene", "Id"))
DF

giving this data frame so DF$Gene is the genes and DF$Id is the id's.

         Gene        Id
1        A1BG         1
2        NAT2        10
3         ADA       100
4        CDH2      1000
5        AKT3     10000
6     GAGE12F 100008586
7   RNA5-8SN5 100008587
8    RNA18SN5 100008588
9    RNA28SN5 100008589
10  LINC02584 100009613
11   POU5F1P5 100009667
12 ZBTB11-AS1 100009676
13       MED6     10001
14      NR2E3     10002
15    NAALAD2     10003
16       DUXB 100033411
17 SNORD116-1 100033413
18 SNORD116-2 100033414
19 SNORD116-3 100033415
20 SNORD116-4 100033416

2) A variation of the above is to first remove the parentheses and then read it in giving the same result. Note that the second argument of chartr contains two spaces so that each parenthesis is translated to a space.

read.table(text = chartr("()", "  ", s), col.names = c("Gene", "Id"))

Note

Lines <-  '[1] "A1BG (1)"               "NAT2 (10)"              "ADA (100)"              "CDH2 (1000)"            "AKT3 (10000)"           "GAGE12F (100008586)"   
 [7] "RNA5-8SN5 (100008587)"  "RNA18SN5 (100008588)"   "RNA28SN5 (100008589)"   "LINC02584 (100009613)"  "POU5F1P5 (100009667)"   "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)"           "NR2E3 (10002)"          "NAALAD2 (10003)"        "DUXB (100033411)"       "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)" '

L <- Lines |>
  textConnection() |>
  readLines() |>
  gsub(pattern = "\\[\\d \\]", replacement = "")
s <- scan(text = L, what = "")

so s looks like this:

> dput(s)
c("A1BG (1)", "NAT2 (10)", "ADA (100)", "CDH2 (1000)", "AKT3 (10000)", 
"GAGE12F (100008586)", "RNA5-8SN5 (100008587)", "RNA18SN5 (100008588)", 
"RNA28SN5 (100008589)", "LINC02584 (100009613)", "POU5F1P5 (100009667)", 
"ZBTB11-AS1 (100009676)", "MED6 (10001)", "NR2E3 (10002)", "NAALAD2 (10003)", 
"DUXB (100033411)", "SNORD116-1 (100033413)", "SNORD116-2 (100033414)", 
"SNORD116-3 (100033415)", "SNORD116-4 (100033416)")

CodePudding user response：

First, in the future please share your data using the dput() command. See this for details.

Second, here is one solution for extracting the parts you need:

library(tidyverse)

g<-c("A1BG (1)","NAT2 (10)","ADA (100)"  , "RNA18SN5 (100008588)",   "RNA28SN5 (100008589)")

gnumber<-stringr::str_extract(g,"(?=\\().*?(?<=\\))")
gnumber     

gname<-stringr::str_extract(g, "[:alpha:] ")
gname

# or, to get the whole first word:
gname<-stringr::word(g,1,1)
gname