Given a dataframe of types and values like so:
topic | keyword |
---|---|
cheese | cheddar |
meat | beef |
meat | chicken |
cheese | swiss |
bread | focaccia |
bread | sourdough |
cheese | gouda |
My aim is to make a set of dynamic regexs based on the type, but I don't know how to make the variable names from the types. I can do this individually like so:
fn_get_topic_regex <- function(targettopic,df)
{
filter_df <- df |>
filter(topic == targettopic)
regex <- paste(filter_df$keyword, collapse = "|")
}
and do things like:
cheese_regex <- fn_get_topic_regex("cheese",df)
But what I'd like to be able to do is build all these regexes automatically without having to define each one.
The intended output would be something like:
cheese_regex: "cheddar|swiss|gouda"
bread_regex: "focaccia|sourdough"
meat_regex: "beef|chicken"
Where the start of the variable name is the distinct topic.
What's the best way to do that without defining each regex individually by hand?
CodePudding user response:
Here is a base R solution with your intended output in a named list.
df <- structure(list(topic = c("cheese", "meat", "meat", "cheese", "bread", "bread", "cheese"),
keyword = c("cheddar", "beef", "chicken", "swiss", "focaccia", "sourdough", "gouda")),
class = "data.frame", row.names = c(NA, -7L))
#split into a list per topic
topics <- split(df, df$topic)
#collapse the keyword column
topics <- lapply(topics, function(t) {
paste(t$keyword, collapse = "|")
})
#rename
names(topics)<- paste0(names(topics), "_regex")
topics
$bread_regex
[1] "focaccia|sourdough"
$cheese_regex
[1] "cheddar|swiss|gouda"
$meat_regex
[1] "beef|chicken"
CodePudding user response:
We could do something like this:
- after grouping we could use
summarise
together withpaste
andcollapse
to get our regex s - Then, when the regex is needed we could refer to it by indexing like the example below:
library(dplyr)
library(stringr) #str_detect
my_regex <- df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
df %>%
mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
> df %>%
mutate(new_col = ifelse(str_detect(keyword, my_regex$regex[1]), "it is bread", "it is not bread"))
topic keyword new_col
1 cheese cheddar it is not bread
2 meat beef it is not bread
3 meat chicken it is not bread
4 cheese swiss it is not bread
5 bread focaccia it is bread
6 bread sourdough it is bread
7 cheese gouda it is not bread
CodePudding user response:
You can use dplyr
's group_by()
and summarise()
df %>%
group_by(topic) %>%
summarise(regex = paste(keyword, collapse = "|"))
# A tibble: 3 × 2
topic regex
<chr> <chr>
1 bread focaccia|sourdough
2 cheese cheddar|swiss|gouda
3 meat beef|chicken
Or you can apply your function to every unique value in df$topic
:
map_chr(unique(df$topic) %>% setNames(paste0(., "_regex")),
fn_get_topic_regex, df = df)
cheese_regex meat_regex bread_regex
"cheddar|swiss|gouda" "beef|chicken" "focaccia|sourdough"
Just remember to add return(regex)
to the end of your function, or not to assign the last line to a variable at all. I would even put everything in a single pipe chain:
fn_get_topic_regex <- function(targettopic,df)
{
df |>
filter(topic == targettopic) |>
pull(keyword) |>
paste(collapse = "|")
}