I have a dataframe with a column containing various words. I also have a separate list of strings (not the same length as the df), and I'd like to create a new column in the dataframe which matches the strings to the words in the column, but only keep the part of the string up to that word.
So for example: I have this table:
words | |
---|---|
apple | |
plant | |
banana | |
animal | |
fly | |
ecoli |
and these strings of words:
stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana","eukaryote;animal;dog", "eukaryote;plant;orange" "eukaryote;animal;cat"; "eukaryote;insect;fly", "prokaryote;bacterium;ecoli")
and I'd like to get this:
words | new_words |
---|---|
apple | eukaryote;plant;apple |
plant | eukaryote;plant |
banana | eukaryote;plant;banana |
animal | eukaryote;animal |
fly | eukaryote;insect;fly |
ecoli | prokaryote;bacterium;ecoli |
I've tried something along the lines of :
df$words <- c("apple", "plant", "banana", "animal", "fly", "ecoli")
df$new_words<- sub(df$words, "", stringlist)
CodePudding user response:
Loop over the 'words' column, get the matching 'stringlist' value with grep
, use sub
to capture the characters including the word and replace it with backreference (\\1
) of the captured group
df$new_words <- sapply(df$words, function(x)
sub(sprintf("(.*%s).*", x), "\\1", grep(x, stringlist,
value = TRUE)[1]))
-output
> df
words new_words
1 apple eukaryote;plant;apple
2 plant eukaryote;plant
3 banana eukaryote;plant;banana
4 animal eukaryote;animal
5 fly eukaryote;insect;fly
6 ecoli prokaryote;bacterium;ecoli
data
df <- structure(list(words = c("apple", "plant", "banana", "animal",
"fly", "ecoli")), class = "data.frame", row.names = c(NA, -6L
))
stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana",
"eukaryote;animal;dog",
"eukaryote;plant;orange", "eukaryote;animal;cat", "eukaryote;insect;fly",
"prokaryote;bacterium;ecoli")