I am struggling with subsetting strings from the column of a dataframe. I am dealing with language data. In my dataframe, I have a 1st column with the verb stem, and a 2nd column with a full sentence containing several words, including one which is the conjugated verb. I would like to create a 3rd column with only the conjugated verb (therefore removing the other words) that contains the same verb stem as in column 1 within the same row. I cannot simply use a list of all verb stems for this, because some sentences contain 2 verbs, and I only want the verb with the same stem as in column 1 in that row.
This is how my data looks like now:
Verb_stem Full_sentence
1. copt to coptu to
2. puns punse kanchina
3. khag basana na lo khagunse nan
And this is the output that I would like:
Verb_stem Full_sentence Conjugated verb
1. copt to coptu to copto
2. puns punse kanchina punse
3. khag basana na lo khagunse nan khagunse
After doing some research, I tried the following formula:
Df$Conjugated_verb <- lapply(strsplit(Df$Full_sentence, " "), grep, pattern = Df$Verb_stem, value = TRUE)
The problem that I am facing right now is that the formula seems to look only for the verbs stem in the 1st row in all sentences, instead of switching to a new verb stem at each row. Here is the output that I get:
Verb_stem Full_sentence Conjugated_verb
1. copt to coptu to coptu
2. puns punse kanchina character(0)
3. khag basana na lo khagunse nan character(0)
I tried many things, and I have been looking for a solution for days, but I really cannot figure out how to do it. If someone had an idea, I would be super grateful! Thanks in advance!
CodePudding user response:
You can use mapply()
to manipulate Verb_stem
and Full_sentence
pairwisely.
within(df, {
Conjugated_verb <- mapply(\(x, y) { z <- strsplit(y, "\\s ")[[1]] ; z[grepl(x, z)] },
Verb_stem, Full_sentence)
})
or
within(df, {
Conjugated_verb <- mapply(\(x, y) sub(sprintf(".*(\\w*%s\\w*).*", x), "\\1", y),
Verb_stem, Full_sentence)
})
Output:
# Verb_stem Full_sentence Conjugated_verb
# 1 copt to coptu to coptu
# 2 puns punse kanchina punse
# 3 khag basana na lo khagunse nan khagunse
CodePudding user response:
We may use vectorized
str_extract
library(dplyr)
library(stringr)
df1 %>%
mutate(Conjugated = str_extract(Full_sentence, str_c(Verb_stem, "\\S*")))
-output
Verb_stem Full_sentence Conjugated
1. copt to coptu to coptu
2. puns punse kanchina punse
3. khag basana na lo khagunse nan khagunse
data
df1 <- structure(list(Verb_stem = c("copt", "puns", "khag"),
Full_sentence = c("to coptu to",
"punse kanchina", "basana na lo khagunse nan")),
class = "data.frame", row.names = c("1.",
"2.", "3."))