I want to clean up an R variable column to get only the species names. I would like to remove the variable names after the 2nd "_".
This is my table :
col1 | Col2 |
---|---|
Pelagodinium_beii_RCC1491_SRR1300503_MMETSP1338c20 | 4 |
Acanthoeca_10tr_SRR1294413_MMETSP0105_2c10003_g1_i1 | 5 |
Rhodosorus_marinus_UTEX-LB-2760_SRR1296985_MMETSP | 5 |
Vannella_sp._CB-2014_DIVA3-518-3-11-1-6_SRR1296762_M | 3 |
Florenciella_parvula_CCMP2471_SRR1294437_MMETSP134 | 5 |
I would like to have :
col1 | Col2 |
---|---|
Pelagodinium_beii | 4 |
Acanthoeca_10tr | 5 |
Rhodosorus_marinus | 5 |
Vannella_sp. | 3 |
Florenciella_parvula | 5 |
I'm not really used to R and I didn't find the right method.
CodePudding user response:
df$col1 <- sub("^([^_] _[^_] )_.*", "\\1", df$col1, perl = TRUE)
df
col1 Col2
1 Pelagodinium_beii 4
2 Acanthoeca_10tr 5
3 Rhodosorus_marinus 5
4 Vannella_sp. 3
5 Florenciella_parvula 5
With df
as follows:
df <- read.table(
text =
'col1 Col2
Pelagodinium_beii_RCC1491_SRR1300503_MMETSP1338c20 4
Acanthoeca_10tr_SRR1294413_MMETSP0105_2c10003_g1_i1 5
Rhodosorus_marinus_UTEX-LB-2760_SRR1296985_MMETSP 5
Vannella_sp._CB-2014_DIVA3-518-3-11-1-6_SRR1296762_M 3
Florenciella_parvula_CCMP2471_SRR1294437_MMETSP134 5
',
header = TRUE
)
CodePudding user response:
An option with strsplit
:
df$col1 <- sapply(df$col1, function(i) paste0(strsplit(i, "_")[[1]][1:2], collapse = '_'))
# col1 Col2
# 1 Pelagodinium_beii 4
# 2 Acanthoeca_10tr 5
# 3 Rhodosorus_marinus 5
# 4 Vannella_sp. 3
# 5 Florenciella_parvula 5
Another way would be to use word
from stringr
package:
library(stringr)
word(df$col1, 1, 2, sep = "_") -> df$col1