How to extract part of a strings in a column of a dataframe using R?-CodePudding

(Sorry if I use the wrong terms and format horribly, this is my first post)

I am trying to extract a specific part of a string from a dataframe. This is what the entire cell looks like:

{temperature:6.689724,geographic location (longitude):-159.0224,collection date:2011-10-05/2011-10-06,environment (biome):marine biome (ENVO:00000447),environment (feature):mesopelagic zone (ENVO:00000213),environment (material):particulate matter, including plankton (ENVO:xxxxxxxx),environmental package:water,sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles,salinity:34.000507,geographic location (latitude):31.528,instrument model:Illumina Genome Analyzer IIx}

I want to extract the bolded part and remove everything before and after it. I want to repeat this for every cell in that column, my original plan was to use str_extract() and remove the string before and including "water," then use str_extract again to remove the string after and including "salinity". Below is my attempt, and the output was that everything under Column1 got removed and replaced with NA.

df$Column1 <- str_extract(df$Column1, "(?<=water, )(\\w )")

Thank you in advance and sorry again for the formatting...

CodePudding user response：

Here's a base R approach that splits the strings at the commas and then selects the element in the resulting vector that starts with "sample collection device". Assume that x is a single string in that column.


grep("^sample collection device", unlist(strsplit(x, ",")), value = TRUE, perl = TRUE)

[1] "sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles"

CodePudding user response：

If the data are the same length, and the string you want is in same position, of each row of column:

library(stringr)

stringyouwant <- str_sub(df$column, startingpositionofstringyouwant, endingpositionofstringyouwant)

CodePudding user response：

I would do it with datatable so you can do all rows easily.

library(data.table)

string<- "{temperature:6.689724,geographic location (longitude):-159.0224,collection date:2011-10-05/2011-10-06,environment (biome):marine biome (ENVO:00000447),environment (feature):mesopelagic zone (ENVO:00000213),environment (material):particulate matter, including plankton (ENVO:xxxxxxxx),environmental package:water,sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles,salinity:34.000507,geographic location (latitude):31.528,instrument model:Illumina Genome Analyzer IIx}}"

dat<- data.frame(String=rbind(string,string))
dat$Substring<- unlist(lapply(dat$String, function(x) data.table::transpose(strsplit(x,','))[9] ))