I want to filter a name in the columns of a data frame in R-CodePudding

I have a table here I have columns with this type of string:

d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Paludibacteraceae;g__uncultured;s__uncultured_bacterium

I would like the columns to remain only with the name that follows after the "p__". For example, in the string above, I would like it to read: Bacteroidota. I have been using the following code to filter the last names, however, it does not filter the names after "p__".

nivel7_especie <- as.data.frame(read_csv("/Users/lorenzo/Documents/FIL - Lab ECyN/Proyecto FATZEIMER/Microbiota/Vegan_Diversity/Tablas/nivel7-especie_con_grupos.csv"))

# Le simplifico los nombres

colnames(nivel7_especie) <- gsub(colnames(nivel7_especie),pattern = '.*p__', replacement = "")

Thanks!

CodePudding user response：

If you are trying to rename the column(s) that start with p__, then you can do this:

colnames(nivel7_especie) <- gsub("^p__","",colnames(nivel7_especie))

If you are trying to retain only the column that start with p__, then you can do this:

nivel7_especie[,grepl("^p__",colnames(nivel7_especie)),drop=F]

CodePudding user response：

If I understand you correctly, you want to reduce certains strings, such as this one to only the alphanumeric string that follows "p__".

Data:

x <- "d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Paludibacteraceae;g__uncultured;s__uncultured_bacterium"

If this is correct you can do it by defining p__ as a positive lookbehind (?<=p__) to match one or more alphanumeric characters \\w occurring right after it:

library(tidyverse)
data.frame(x) %>%
  mutate(p = str_extract(x, "(?<=p__)\\w "))
1 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Paludibacteraceae;g__uncultured;s__uncultured_bacterium
             p
1 Bacteroidota