I want to concatenate the text entries by the factor categories (module_components). Eventually, I need to get word frequencies/n-grams for the text for each factor (module_components) in my dataset. So I want to concatenate all text entries in each factor level first.
Data I have:
Row | module_component | Long_Text |
---|---|---|
1 | Computer123 | Computer retuned due to a battery issue |
2 | Computer 123 | The computer did not power on |
3 | Laptop42 | Screen Broken |
4 | Laptop42 | Keyboard unresponsive |
5 | Lapop62 | Battery chord issues |
Data I want: list of categories (module_components) with the concatenated text fields (Long_Text)
module_component | contatonated_Long_Text |
---|---|
Computer123 | Computer retuned due to a battery issue The computer did not power on |
Laptop42 | Screen Broken Keyboard unresponsive |
Lapop62 | Battery chord issues |
Code I have tried
df_split <- split(df, paste0(df$module_component))
list_by_modules <- lapply(df_split, FUN = paste(df_split$LongText)) #**STUCK HERE**
I am unsure about the function for the concatenation piece. The paste(Long_Text)
is not working.
I am open to any other methods to get this done. Thank you
CodePudding user response:
A possible solution, based on stringr
(mutate
is to remove the space in Computer 123
, which, I guess, is a typo):
library(tidyverse)
df <- data.frame(
Row = c(1L, 2L, 3L, 4L, 5L),
module_component = c("Computer123",
"Computer 123","Laptop42","Laptop42","Lapop62"),
Long_Text = c("Computer retuned due to a battery issue",
"The computer did not power on","Screen Broken","Keyboard unresponsive",
"Battery chord issues")
)
df %>%
mutate(module_component = str_remove_all(module_component,"\\s")) %>%
group_by(module_component) %>%
summarise(Long_Text = str_c(Long_Text, collapse = " "))
#> # A tibble: 3 × 2
#> module_component Long_Text
#> <chr> <chr>
#> 1 Computer123 Computer retuned due to a battery issue The computer did not…
#> 2 Lapop62 Battery chord issues
#> 3 Laptop42 Screen Broken Keyboard unresponsive
CodePudding user response:
Using toString
in aggregate
.
aggregate(Long_Text ~ module_component, dat, toString)
# module_component Long_Text
# 1 Computer123 Computer retuned due to a battery issue, The computer did not power on
# 2 Lapop62 Battery chord issues
# 3 Laptop42 Screen Broken, Keyboard unresponsive
Or paste
.
aggregate(Long_Text ~ module_component, dat, paste, collapse=' ')
# module_component Long_Text
# 1 Computer123 Computer retuned due to a battery issue The computer did not power on
# 2 Lapop62 Battery chord issues
# 3 Laptop42 Screen Broken Keyboard unresponsive
I would prefer toString
though.
Data:
dat <- structure(list(Row = 1:5, module_component = c("Computer123",
"Computer123", "Laptop42", "Laptop42", "Lapop62"), Long_Text = c("Computer retuned due to a battery issue",
"The computer did not power on", "Screen Broken", "Keyboard unresponsive",
"Battery chord issues")), class = "data.frame", row.names = c(NA,
-5L))