Home > Software engineering >  how to edit rownames in R using sub or gsub command
how to edit rownames in R using sub or gsub command

Time:07-02

I have a gene expression file and its row names is like this: GTEX.1117F.3226.SM.5N9CT enter image description here I want to edit its rownames to be like this:

GTEX-1117F and so on.

I used these commands:

row.names(gene_exp_transpose) <- data
gsub(".","-",row.names(gene_exp_transpose)) #this just gives ----- to all the rownames data 
row.names(gene_exp) substr(data, 0,5) ## but for the last rows, it has 4 character instead of 5.

CodePudding user response:

We could do it this way:

  1. row names to columns with rownames_to_colum from tibble package

  2. using regular expression: 'sub('^([^.] .[^.] ).*', '\\1' removes everything after second dot.

  3. replace . by -

  4. And back to rownmaes

library(tibble)
library(dplyr)

df %>% 
  rownames_to_column("X") %>% 
  mutate(X = sub('^([^.] .[^.] ).*', '\\1', X),
         X = sub('\\.', '-', X)) %>% 
  column_to_rownames("X")

output:

           ENSG00000223972.5 ENSG00000227232.5 ENSG00000278267.1 ENSG00000243485.5
GTEX-1117F         1.0705061         319.01082         0.0000000         0.0000000
GTEX-111FC         0.0000000         137.62750         0.8192113         1.6384227
GTEX-1128S         0.9312597          98.71353         0.0000000         0.9312597
GTEX-117XS         0.0000000         140.96666         0.0000000         0.7661232
GTEX-1192X         0.9374262         139.67650         0.0000000         0.9374262

data:

structure(list(ENSG00000223972.5 = c(1.0705061, 0, 0.9312597, 
0, 0.9374262), ENSG00000227232.5 = c(319.01082, 137.6275, 98.71353, 
140.96666, 139.6765), ENSG00000278267.1 = c(0, 0.8192113, 0, 
0, 0), ENSG00000243485.5 = c(0, 1.6384227, 0.9312597, 0.7661232, 
0.9374262)), class = "data.frame", row.names = c("GTEX.1117F.3226.SM.5N9CT", 
"GTEX.111FC.3126.SM.5GZZ2", "GTEX.1128S.2726.SM.5H12C", "GTEX.117XS.3026.SM.5N9CA", 
"GTEX.1192X.3126.SM.5N9BY"))

CodePudding user response:

A base R solution. Data borrowed from TarJae's answer.

In the first instruction, the regex is almost identical to TarJae's, with two differences:

  1. The first period to be matched is escaped;
  2. the end of string is made explicit.

Then the only period is replaced by a dash "_".

row.names(df) <- sub('^([^.] \\.[^.] ).*$', '\\1', row.names(df))
row.names(df) <- sub('\\.', '-', row.names(df))
row.names(df)
#> [1] "GTEX-1117F" "GTEX-111FC" "GTEX-1128S" "GTEX-117XS" "GTEX-1192X"

Created on 2022-07-02 by the reprex package (v2.0.1)

  • Related