I'm looking to add an unknown number of new columns to a data frame based on the number of unique values found in a column of another data frame. I have the following two data frames:
user | text.string |
---|---|
Bob | I like yellow submarines |
Jane | I like red cars |
my.base.df <- data.frame(
"user" = c("Bob", "Jane")
, "text.string" = c("I like yellow submarines", "I like red cars")
)
theme | term |
---|---|
colours | yellow |
colours | red |
colours | blue |
cars | ford |
cars | toyota |
cars | fiat |
my.theme.df <- data.frame(
"theme" = c(rep("cars", 3), rep("colours", 3))
, "term" = c("ford", "toyota", "fiat", "red", "yellow", "blue")
)
And I want to flag the themes found in each text.string, to end up with something like this:
user | text.string | cars | colours |
---|---|---|---|
Bob | I like yellow submarines | 0 | 1 |
Jane | I like red cars | 1 | 1 |
I think I can match the terms to the text.string with a for loop, but I'm worried it's not scalable outside of this toy example. But the bit I'm really stuck on is that I can't create the "cars" or "colours" columns in my.base.df
dynamically from the result of levels(my.theme$themes)
In the real world the number of levels in my.theme.df$theme
could be up to twenty, with over one hundred my.theme.df$term
matching a my.theme.df$theme
. Similarly, my.base.df
could contain upto one thousand observations, so I'm worried about efficiency too.
Any help or pointers would be great?
Thank you,
Jamie
CodePudding user response:
Split your data frame by theme, and then use the theme all related terms on the relevant string.
my.base.df <- data.frame(
"user" = c("Bob", "Jane"),
"text.string" = c("I like yellow submarines", "I like red cars")
)
my.theme.df <- data.frame(
"theme" = c(rep("cars", 3), rep("colours", 3)),
"term" = c("ford", "toyota", "fiat", "red", "yellow", "blue")
)
theme_split <- split(my.theme.df, my.theme.df$theme)
for (x in names(theme_split)) {
# include theme itself in search
terms <- paste(c(theme_split[[x]][["term"]], x), collapse = "|")
my.base.df[[x]] <- grepl(terms, my.base.df$text.string)
}
my.base.df
#> user text.string cars colours
#> 1 Bob I like yellow submarines FALSE TRUE
#> 2 Jane I like red cars TRUE TRUE