Home > other >  Create a column value based on a matching regular expression
Create a column value based on a matching regular expression

Time:04-07

I have the following character string in a column called "Sentences" for a df:

I like an apple

I would like to create a second column, called Type, whose values are determined by matching strings. I would like to take the regular expression \bapple\b, match it with the sentence and if it matches, add the value Fruit_apple in the Type column.

In the long run I'd like to do this with several other strings and types.

Is there an easy way to do this using a function?

dataset (survey_1):

structure(list(slider_8.response = c(1L, 1L, 3L, 7L, 7L, 7L, 
1L, 3L, 2L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 7L, 
7L, 7L, 1L, 1L, 7L, 6L, 6L, 1L, 1L, 7L, 1L, 7L, 7L, 1L, 7L, 7L, 
7L, 7L, 7L, 6L, 7L, 7L, 7L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 7L, 2L
), Sentences = c("He might could do it.", "I ever see the film.", 
"I may manage to come visit soon.", "She’ll never be forgotten.", 
"They might find something special.", "It might not be a good buy.", 
"Maybe my pain will went away.", "Stephen maybe should fix your bicycle.", 
"It used to didnʼt matter if you walked in late.", "He’d could climb the stairs.", 
"Only Graeme would might notice that.", "I used to cycle a lot. ", 
"Your dad belongs to disagree with this. ", "We can were pleased to see her.", 
"He may should take us to the city.", "I could never forgot his deep voice.", 
"I should can turn this thing over to Ann.", "They must knew who they really are.", 
"We used to runs down three flights.", "I don’t care what he may be up to. ", 
"That’s something I ain’t know about.", "That must be quite a skill.", 
"We must be able to invite Jim.", "She used to play with a trolley.", 
"He is done gone. ", "You might can check this before making a decision.", 
"It would have a positive effect on the team. ", "Ruth can maybe look for it later.", 
"You should tag along at the dance.", "They’re finna leave town.", 
"A poem should looks like that.", "I can tell you didn’t do your homework. ", 
"I can driving now.", "They should be able to put a blanket over it.", 
"We could scarcely see each other.", "I might says I was never good at maths.", 
"The next dance will be a quickstep. ", "I might be able to find myself a seat in this place.", 
"Andrew thinks we shouldn’t do it.", "Jack could give a hand.", 
"She’ll be able to come to the event.", "She’d maybe keep the car the way it is.", 
"Sarah used to be able to agree with this proposal.", "I’d like to see your lights working. ", 
"I’d be able to get a little bit more sleep.", "John may has a second name.", 
"You must can apply for this job.", "I maybe could wait till the 8 o’clock train.", 
"She used to could go if she finished early.", "That would meaned something else, eh?", 
"You’ll can enjoy your holiday.", "We liketa drowned that day. ", 
"I must say it’s a nice feeling.", "I eaten my lunch."), construct = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA)), row.names = c(NA, 54L), class = "data.frame")

type_list:

list("DM_will_can"=c("ll can","will can"), "DM_would_could"=c("d could","would could"),
                  "DM_might_can"="might can","DM_might_could"="might could","DM_used_to_could"="used to could",
                  "DM_should_can"="should can","DM_would_might"=c("d might", "would might"),"DM_may_should"="may should",
                  "DM_must_can"="must can", "SP_will_be_able"=c("ll be able","will be able"),
                  "SP_would_be_able"=c("d be able","would be able"),"SP_might_be_able"="might be able",
                  "SP_maybe_could"="maybe could","SP_used_to_be_able"="used to be able","SP_should_be_able"=
                    "should be able","SP_would_maybe"=c("d maybe", "would maybe"), "SP_maybe_should"="maybe should",
                  "SP_must_be_able"="must be able", "Filler_will_a"="quickstep","Filler_will_b"="forgotten",
                  "Filler_would_a"="lights working","Filler_would_b"="positive effect","Filler_can_a"="homework",
                  "Filler_can_b"="Ruth","Filler_could_a"="scarcely","Filler_could_b"="Jack", "Filler_may_a"="may be up to",
                  "Filler_may_b"="visit soon", "Filler_might_a"="good buy","Filler_might_be"="something special",
                  "Filler_should_a"="tag along","Filler_should_b"="Andrew","Filler_used_to_a"="trolley",
                  "Filler_used_to_b"="cycle a lot","Filler_must_a"="quite a skill","Filler_must_b"="nice feeling",
                  "Dist_gram_will_went"="will went","Dist_gram_meaned"="meaned","Dist_gram_can_were"="can were",
                  "Dist_gram_forgot"="never forgot", "Dist_gram_may_has"="may has", 
                  "Dist_gram_might_says"="might says","Dist_gram_used_to_runs"="used to runs",
                  "Dist_gram_should_looks"="should looks","Dist_gram_must_knew"="must knew","Dist_dial_liketa"="liketa",
                  "Dist_dial_belongs"="belongs to disagree","Dist_dial_finna"="finna","Dist_dial_used_to_didnt"="used to didn't matter",
                  "Dist_dial_eaten"="I eaten", "Dist_dial_can_driving"="can driving","Dist_dial_aint_know"="That's something",
                  "Dist_dial_ever_see"="ever see the film","Dist_dial_done_gone"="done gone")

CodePudding user response:

I want to do this with a Python dictionary, but we're talking about R, so I've more or less translated the approach. There is probably a more idiomatic way to do this in R than two for loops, but this should work:

# Define data
df <- data.frame(
    id = c(1:5),
    sentences = c("I like apples", "I like dogs", "I have cats", "Dogs are cute", "I like fish")
)

#   id     sentences
# 1  1 I like apples
# 2  2   I like dogs
# 3  3   I have cats
# 4  4 Dogs are cute
# 5  5   I like fish

type_list <- list(
    "fruit" = c("apples", "oranges"),
    "animals" = c("dogs", "cats")
)

types <- names(type_list)

df$type <- NA
df$item <- NA

for (type in types) {
    for (item in type_list[[type]]) {
        matches <- grep(item, df$sentences, ignore.case = TRUE)
        df[matches, "type"]  = type
        df[matches, "item"]  = item
    }
}


# Output:
#   id     sentences    type   item
# 1  1 I like apples   fruit apples
# 2  2   I like dogs animals   dogs
# 3  3   I have cats animals   cats
# 4  4 Dogs are cute animals   dogs
# 5  5   I like fish    <NA>   <NA>

EDIT

Added after data was added. If I read in your data, and call it df, and your type list and call it type_list, the following works:


types <- names(type_list)

df$type <- NA
df$item <- NA

for (type in types) {
    for (item in type_list[[type]]) {
        matches <- grep(item, df$Sentences, ignore.case = TRUE)
        df[matches, "type"]  = type
        df[matches, "item"]  = item
    }
}

This is exactly the same as my previous code, except Sentences has an upper case S in your data frame.

  • Related