Home > Back-end >  Remove a part of variable names in R column
Remove a part of variable names in R column

Time:02-23

I want to clean up an R variable column to get only the species names. I would like to remove the variable names after the 2nd "_".

This is my table :

col1 Col2
Pelagodinium_beii_RCC1491_SRR1300503_MMETSP1338c20 4
Acanthoeca_10tr_SRR1294413_MMETSP0105_2c10003_g1_i1 5
Rhodosorus_marinus_UTEX-LB-2760_SRR1296985_MMETSP 5
Vannella_sp._CB-2014_DIVA3-518-3-11-1-6_SRR1296762_M 3
Florenciella_parvula_CCMP2471_SRR1294437_MMETSP134 5

I would like to have :

col1 Col2
Pelagodinium_beii 4
Acanthoeca_10tr 5
Rhodosorus_marinus 5
Vannella_sp. 3
Florenciella_parvula 5

I'm not really used to R and I didn't find the right method.

CodePudding user response:

df$col1 <- sub("^([^_] _[^_] )_.*", "\\1", df$col1, perl = TRUE)
df
                  col1 Col2
1    Pelagodinium_beii    4
2      Acanthoeca_10tr    5
3   Rhodosorus_marinus    5
4         Vannella_sp.    3
5 Florenciella_parvula    5

With df as follows:

df <- read.table(
  text =
'col1   Col2
Pelagodinium_beii_RCC1491_SRR1300503_MMETSP1338c20  4
Acanthoeca_10tr_SRR1294413_MMETSP0105_2c10003_g1_i1 5
Rhodosorus_marinus_UTEX-LB-2760_SRR1296985_MMETSP   5
Vannella_sp._CB-2014_DIVA3-518-3-11-1-6_SRR1296762_M    3
Florenciella_parvula_CCMP2471_SRR1294437_MMETSP134  5
',
  header = TRUE
)

CodePudding user response:

An option with strsplit:

df$col1 <- sapply(df$col1, function(i) paste0(strsplit(i, "_")[[1]][1:2], collapse = '_'))


# col1 Col2
# 1    Pelagodinium_beii    4
# 2      Acanthoeca_10tr    5
# 3   Rhodosorus_marinus    5
# 4         Vannella_sp.    3
# 5 Florenciella_parvula    5

Another way would be to use word from stringr package:

library(stringr)
word(df$col1, 1, 2, sep = "_") -> df$col1
  • Related