loop or lappy() - string concatenation by factors-CodePudding

I want to concatenate the text entries by the factor categories (module_components). Eventually, I need to get word frequencies/n-grams for the text for each factor (module_components) in my dataset. So I want to concatenate all text entries in each factor level first.

Data I have:

Row	module_component	Long_Text
1	Computer123	Computer retuned due to a battery issue
2	Computer 123	The computer did not power on
3	Laptop42	Screen Broken
4	Laptop42	Keyboard unresponsive
5	Lapop62	Battery chord issues

Data I want: list of categories (module_components) with the concatenated text fields (Long_Text)

module_component	contatonated_Long_Text
Computer123	Computer retuned due to a battery issue The computer did not power on
Laptop42	Screen Broken Keyboard unresponsive
Lapop62	Battery chord issues

Code I have tried

df_split <- split(df, paste0(df$module_component))
list_by_modules <- lapply(df_split, FUN = paste(df_split$LongText)) #**STUCK HERE**

I am unsure about the function for the concatenation piece. The paste(Long_Text) is not working.

I am open to any other methods to get this done. Thank you

CodePudding user response：

A possible solution, based on stringr (mutate is to remove the space in Computer 123, which, I guess, is a typo):

library(tidyverse)

df <- data.frame(
  Row = c(1L, 2L, 3L, 4L, 5L),
  module_component = c("Computer123",
                       "Computer 123","Laptop42","Laptop42","Lapop62"),
  Long_Text = c("Computer retuned due to a battery issue",
                "The computer did not power on","Screen Broken","Keyboard unresponsive",
                "Battery chord issues")
)

df %>% 
  mutate(module_component = str_remove_all(module_component,"\\s")) %>% 
  group_by(module_component) %>% 
  summarise(Long_Text = str_c(Long_Text, collapse = " "))

#> # A tibble: 3 × 2
#>   module_component Long_Text                                                    
#>   <chr>            <chr>                                                        
#> 1 Computer123      Computer retuned due to a battery issue The computer did not…
#> 2 Lapop62          Battery chord issues                                         
#> 3 Laptop42         Screen Broken Keyboard unresponsive

CodePudding user response：

Using toString in aggregate.

aggregate(Long_Text ~ module_component, dat, toString)
#   module_component                                                              Long_Text
# 1      Computer123 Computer retuned due to a battery issue, The computer did not power on
# 2          Lapop62                                                   Battery chord issues
# 3         Laptop42                                   Screen Broken, Keyboard unresponsive

Or paste.

aggregate(Long_Text ~ module_component, dat, paste, collapse=' ')
#   module_component                                                             Long_Text
# 1      Computer123 Computer retuned due to a battery issue The computer did not power on
# 2          Lapop62                                                  Battery chord issues
# 3         Laptop42                                   Screen Broken Keyboard unresponsive

I would prefer toString though.

Data:

dat <- structure(list(Row = 1:5, module_component = c("Computer123", 
"Computer123", "Laptop42", "Laptop42", "Lapop62"), Long_Text = c("Computer retuned due to a battery issue", 
"The computer did not power on", "Screen Broken", "Keyboard unresponsive", 
"Battery chord issues")), class = "data.frame", row.names = c(NA, 
-5L))