Home > Software engineering >  loop or lappy() - string concatenation by factors
loop or lappy() - string concatenation by factors

Time:03-26

I want to concatenate the text entries by the factor categories (module_components). Eventually, I need to get word frequencies/n-grams for the text for each factor (module_components) in my dataset. So I want to concatenate all text entries in each factor level first.

Data I have:

Row module_component Long_Text
1 Computer123 Computer retuned due to a battery issue
2 Computer 123 The computer did not power on
3 Laptop42 Screen Broken
4 Laptop42 Keyboard unresponsive
5 Lapop62 Battery chord issues

Data I want: list of categories (module_components) with the concatenated text fields (Long_Text)

module_component contatonated_Long_Text
Computer123 Computer retuned due to a battery issue The computer did not power on
Laptop42 Screen Broken Keyboard unresponsive
Lapop62 Battery chord issues

Code I have tried

df_split <- split(df, paste0(df$module_component))
list_by_modules <- lapply(df_split, FUN = paste(df_split$LongText)) #**STUCK HERE**
              

I am unsure about the function for the concatenation piece. The paste(Long_Text) is not working.

I am open to any other methods to get this done. Thank you

CodePudding user response:

A possible solution, based on stringr (mutate is to remove the space in Computer 123, which, I guess, is a typo):

library(tidyverse)

df <- data.frame(
  Row = c(1L, 2L, 3L, 4L, 5L),
  module_component = c("Computer123",
                       "Computer 123","Laptop42","Laptop42","Lapop62"),
  Long_Text = c("Computer retuned due to a battery issue",
                "The computer did not power on","Screen Broken","Keyboard unresponsive",
                "Battery chord issues")
)

df %>% 
  mutate(module_component = str_remove_all(module_component,"\\s")) %>% 
  group_by(module_component) %>% 
  summarise(Long_Text = str_c(Long_Text, collapse = " "))

#> # A tibble: 3 × 2
#>   module_component Long_Text                                                    
#>   <chr>            <chr>                                                        
#> 1 Computer123      Computer retuned due to a battery issue The computer did not…
#> 2 Lapop62          Battery chord issues                                         
#> 3 Laptop42         Screen Broken Keyboard unresponsive

CodePudding user response:

Using toString in aggregate.

aggregate(Long_Text ~ module_component, dat, toString)
#   module_component                                                              Long_Text
# 1      Computer123 Computer retuned due to a battery issue, The computer did not power on
# 2          Lapop62                                                   Battery chord issues
# 3         Laptop42                                   Screen Broken, Keyboard unresponsive

Or paste.

aggregate(Long_Text ~ module_component, dat, paste, collapse=' ')
#   module_component                                                             Long_Text
# 1      Computer123 Computer retuned due to a battery issue The computer did not power on
# 2          Lapop62                                                  Battery chord issues
# 3         Laptop42                                   Screen Broken Keyboard unresponsive

I would prefer toString though.


Data:

dat <- structure(list(Row = 1:5, module_component = c("Computer123", 
"Computer123", "Laptop42", "Laptop42", "Lapop62"), Long_Text = c("Computer retuned due to a battery issue", 
"The computer did not power on", "Screen Broken", "Keyboard unresponsive", 
"Battery chord issues")), class = "data.frame", row.names = c(NA, 
-5L))
  • Related