Home > database >  Extract string up to a different word in each row - R
Extract string up to a different word in each row - R

Time:10-14

I have a dataframe with a column containing various words. I also have a separate list of strings (not the same length as the df), and I'd like to create a new column in the dataframe which matches the strings to the words in the column, but only keep the part of the string up to that word.

So for example: I have this table:

words
apple
plant
banana
animal
fly
ecoli

and these strings of words:

stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana","eukaryote;animal;dog", "eukaryote;plant;orange" "eukaryote;animal;cat"; "eukaryote;insect;fly", "prokaryote;bacterium;ecoli")

and I'd like to get this:

words new_words
apple eukaryote;plant;apple
plant eukaryote;plant
banana eukaryote;plant;banana
animal eukaryote;animal
fly eukaryote;insect;fly
ecoli prokaryote;bacterium;ecoli

I've tried something along the lines of :

df$words <- c("apple", "plant", "banana", "animal", "fly", "ecoli")
df$new_words<- sub(df$words, "", stringlist)

CodePudding user response:

Loop over the 'words' column, get the matching 'stringlist' value with grep, use sub to capture the characters including the word and replace it with backreference (\\1) of the captured group

df$new_words <- sapply(df$words, function(x) 
    sub(sprintf("(.*%s).*", x), "\\1", grep(x, stringlist, 
     value = TRUE)[1]))

-output

> df
   words                  new_words
1  apple      eukaryote;plant;apple
2  plant            eukaryote;plant
3 banana     eukaryote;plant;banana
4 animal           eukaryote;animal
5    fly       eukaryote;insect;fly
6  ecoli prokaryote;bacterium;ecoli

data

df <- structure(list(words = c("apple", "plant", "banana", "animal", 
"fly", "ecoli")), class = "data.frame", row.names = c(NA, -6L
))

stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana", 
"eukaryote;animal;dog", 
"eukaryote;plant;orange", "eukaryote;animal;cat", "eukaryote;insect;fly", 
"prokaryote;bacterium;ecoli")
  •  Tags:  
  • r
  • Related