subset a string within a dataframe based on value of another column-CodePudding

I am struggling with subsetting strings from the column of a dataframe. I am dealing with language data. In my dataframe, I have a 1st column with the verb stem, and a 2nd column with a full sentence containing several words, including one which is the conjugated verb. I would like to create a 3rd column with only the conjugated verb (therefore removing the other words) that contains the same verb stem as in column 1 within the same row. I cannot simply use a list of all verb stems for this, because some sentences contain 2 verbs, and I only want the verb with the same stem as in column 1 in that row.

This is how my data looks like now:

   Verb_stem       Full_sentence 
1. copt            to coptu to 
2. puns            punse kanchina 
3. khag            basana na lo khagunse nan

And this is the output that I would like:

   Verb_stem       Full_sentence              Conjugated verb         
1. copt            to coptu to                copto
2. puns            punse kanchina             punse
3. khag            basana na lo khagunse nan  khagunse

After doing some research, I tried the following formula:

Df$Conjugated_verb <- lapply(strsplit(Df$Full_sentence, " "), grep, pattern = Df$Verb_stem, value = TRUE)

The problem that I am facing right now is that the formula seems to look only for the verbs stem in the 1st row in all sentences, instead of switching to a new verb stem at each row. Here is the output that I get:

   Verb_stem       Full_sentence               Conjugated_verb 
1. copt            to coptu to                 coptu
2. puns            punse kanchina              character(0)
3. khag            basana na lo khagunse nan   character(0)

I tried many things, and I have been looking for a solution for days, but I really cannot figure out how to do it. If someone had an idea, I would be super grateful! Thanks in advance!

CodePudding user response：

You can use mapply() to manipulate Verb_stem and Full_sentence pairwisely.

within(df, {
  Conjugated_verb <- mapply(\(x, y) { z <- strsplit(y, "\\s ")[[1]] ; z[grepl(x, z)] },
                            Verb_stem, Full_sentence)
})

within(df, {
  Conjugated_verb <- mapply(\(x, y) sub(sprintf(".*(\\w*%s\\w*).*", x), "\\1", y),
                            Verb_stem, Full_sentence)
})

Output:

#   Verb_stem             Full_sentence Conjugated_verb
# 1      copt               to coptu to           coptu
# 2      puns            punse kanchina           punse
# 3      khag basana na lo khagunse nan        khagunse

CodePudding user response：

We may use vectorized str_extract

library(dplyr)
library(stringr)
df1 %>%
    mutate(Conjugated = str_extract(Full_sentence, str_c(Verb_stem, "\\S*")))

-output

   Verb_stem             Full_sentence Conjugated
1.      copt               to coptu to      coptu
2.      puns            punse kanchina      punse
3.      khag basana na lo khagunse nan   khagunse

data

df1 <- structure(list(Verb_stem = c("copt", "puns", "khag"), 
Full_sentence = c("to coptu to", 
"punse kanchina", "basana na lo khagunse nan")), 
class = "data.frame", row.names = c("1.", 
"2.", "3."))