Home > Blockchain >  How to extract part of a strings in a column of a dataframe using R?
How to extract part of a strings in a column of a dataframe using R?

Time:06-24

(Sorry if I use the wrong terms and format horribly, this is my first post)

I am trying to extract a specific part of a string from a dataframe. This is what the entire cell looks like:

{temperature:6.689724,geographic location (longitude):-159.0224,collection date:2011-10-05/2011-10-06,environment (biome):marine biome (ENVO:00000447),environment (feature):mesopelagic zone (ENVO:00000213),environment (material):particulate matter, including plankton (ENVO:xxxxxxxx),environmental package:water,sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles,salinity:34.000507,geographic location (latitude):31.528,instrument model:Illumina Genome Analyzer IIx}

I want to extract the bolded part and remove everything before and after it. I want to repeat this for every cell in that column, my original plan was to use str_extract() and remove the string before and including "water," then use str_extract again to remove the string after and including "salinity". Below is my attempt, and the output was that everything under Column1 got removed and replaced with NA.

df$Column1 <- str_extract(df$Column1, "(?<=water, )(\\w )")

Thank you in advance and sorry again for the formatting...

CodePudding user response:

Here's a base R approach that splits the strings at the commas and then selects the element in the resulting vector that starts with "sample collection device". Assume that x is a single string in that column.


grep("^sample collection device", unlist(strsplit(x, ",")), value = TRUE, perl = TRUE)

[1] "sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles"

CodePudding user response:

If the data are the same length, and the string you want is in same position, of each row of column:

library(stringr)

stringyouwant <- str_sub(df$column, startingpositionofstringyouwant, endingpositionofstringyouwant)

CodePudding user response:

I would do it with datatable so you can do all rows easily.

library(data.table)

string<- "{temperature:6.689724,geographic location (longitude):-159.0224,collection date:2011-10-05/2011-10-06,environment (biome):marine biome (ENVO:00000447),environment (feature):mesopelagic zone (ENVO:00000213),environment (material):particulate matter, including plankton (ENVO:xxxxxxxx),environmental package:water,sample collection device or method:ROSETTE sampler with CTD (sbe9C) and 10 Niskin bottles,salinity:34.000507,geographic location (latitude):31.528,instrument model:Illumina Genome Analyzer IIx}}"

dat<- data.frame(String=rbind(string,string))
dat$Substring<- unlist(lapply(dat$String, function(x) data.table::transpose(strsplit(x,','))[9] ))
  •  Tags:  
  • r
  • Related