(Sorry if I use the wrong terms and format horribly, this is my first post)
I am trying to extract a specific part of a string from a dataframe. This is what the entire cell looks like:
{temperature:6.689724,geographic location (longitude):-159.0224,collection date:2011-10-05/2011-10-06,environment (biome):marine biome (ENVO:00000447),environment (feature):mesopelagic zone (ENVO:00000213),environment (material):particulate matter, including plankton (ENVO:xxxxxxxx),environmental package:water,sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles,salinity:34.000507,geographic location (latitude):31.528,instrument model:Illumina Genome Analyzer IIx}
I want to extract the bolded part and remove everything before and after it. I want to repeat this for every cell in that column, my original plan was to use str_extract() and remove the string before and including "water," then use str_extract again to remove the string after and including "salinity". Below is my attempt, and the output was that everything under Column1 got removed and replaced with NA.
df$Column1 <- str_extract(df$Column1, "(?<=water, )(\\w )")
Thank you in advance and sorry again for the formatting...
CodePudding user response:
Here's a base R approach that splits the strings at the commas and then selects the element in the resulting vector that starts with "sample collection device". Assume that x
is a single string in that column.
grep("^sample collection device", unlist(strsplit(x, ",")), value = TRUE, perl = TRUE)
[1] "sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles"
CodePudding user response:
If the data are the same length, and the string you want is in same position, of each row of column:
library(stringr)
stringyouwant <- str_sub(df$column, startingpositionofstringyouwant, endingpositionofstringyouwant)
CodePudding user response:
I would do it with datatable so you can do all rows easily.
library(data.table)
string<- "{temperature:6.689724,geographic location (longitude):-159.0224,collection date:2011-10-05/2011-10-06,environment (biome):marine biome (ENVO:00000447),environment (feature):mesopelagic zone (ENVO:00000213),environment (material):particulate matter, including plankton (ENVO:xxxxxxxx),environmental package:water,sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles,salinity:34.000507,geographic location (latitude):31.528,instrument model:Illumina Genome Analyzer IIx}}"
dat<- data.frame(String=rbind(string,string))
dat$Substring<- unlist(lapply(dat$String, function(x) data.table::transpose(strsplit(x,','))[9] ))