Home > Software engineering >  using key word to label a new column in R
using key word to label a new column in R

Time:10-16

I need to mutate a new column "Group" by those keyword, I tried to using %in% but not got data I expected.

I want to create an extra column names'group' in my df data frame. In this column, I want lable every rows by using some keywords. (from the keywords vector or may be another keywords dataframe) For example:

library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns", 
                  "Deepak Nirula: The man who brought burgers and pizzas to India",
                  "Phil Foden: Manchester City midfielder signs new deal with club until 2027",
                  "The Danish tradition we all need now",
                  "Slovakia LGBT attack"),
        Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
        Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
                 "For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
                 The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
                 "Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
                 City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
                 "Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
                 "The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
          )


keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")

I need to mutate a new column "Group" by those keyword if match keyword1 lable "Politics", if match keyword2 lable "Food", if match keyword3 lable "Sport", if match keyword4 lable "Travel", if match keyword5 lable "LGBT".

Can also ignore.case ?

Below is expected output

Title Text Group
Iran: How.. Iranian... Politics
Deepak Nir.. For any... Food
Phil Foden.. Stockpo... Sport
The Danish.. Norwegi... Travel
Slovakia L.. The men... LGBT

enter image description here

Thanks to everyone who spending time.

CodePudding user response:

you could try this:

df %>%
  rowwise %>%
  mutate(
    ## add column with words found in title or text (splitting by non-word character):
    words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
    group = {
      categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
      ## i indexes those items (=keyword vectors) of list 'categories'
      ## which share at least one word with column Title or Text (so that length > 0)
      i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
      ## pick group name via index; join with ',' if more than one category applies
      c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
    }
  )

output:

## # A tibble: 5 x 4
## # Rowwise: 
##   Title                                                        Text  words group
##   <chr>                                                        <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns      "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food 
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now                         "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack                                         "The~ <chr> LGBD 

CodePudding user response:

Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:

library(tidyverse)
df %>%
  mutate(
    All = str_c(Title, Text), 
    Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
                              ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
                                                ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
                                                       ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
    ) %>%
  select(Group)
# A tibble: 5 × 1
  Group   
  <chr>   
1 Politics
2 Food    
3 Sport   
4 Travel  
5 LGBT   
  •  Tags:  
  • r
  • Related