Grouping text data in a corpus by a data frame variable-CodePudding

I have a data frame in R with a column that I need to do basic text analysis on. I am able to do this modifying the code as needed from this source. However, I now need to do this same analysis but for groups of data. I've included the dput of a small sample here.

structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W", 
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance", 
"waiting on wireline", 
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")

I want to group by the variable Pad.Name. I've tried using corpus_group function from the quanteda as well as the corpus function from the same package, setting the parameters as follows: docid_field = dat$Pad.Name and text_field = dat$Message. Yet none of these seem to work.

My desired output are the most frequent words, say the top 10 most frequent, and a count of those words, for each unique Pad.Name. Similar something to as follows, however the true counts would work out, obviously:

edit: the table option never seems to work here, so here is a dput and data frame of my desired output

structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", 
"LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3, 
2, 2, 2)), class = "data.frame", row.names = c(NA, -4L))

output <- data.frame(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", "LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3,2,2,2))

CodePudding user response：

Would dplyr and tidytext do?

library(tidytext)
library(dplyr)

as_tibble(data) %>% 
  # split to words
  unnest_tokens(word,Message) %>% 
  # filter out stopwords
  anti_join(get_stopwords()) %>% 
  # count by (Pad.Name, word) groups 
  count(Pad.Name, word, name = "Count", sort = T) %>%
  # output is sorted by Count, no grouping, keep top-4
  slice_head(n = 4) %>% 
  arrange(Pad.Name, desc(Count))
#> Joining, by = "word"
#> # A tibble: 4 × 3
#>   Pad.Name   word     Count
#>   <chr>      <chr>    <int>
#> 1 LEE        waiting      2
#> 2 LEE        wireline     2
#> 3 MISSOURI W pump         3
#> 4 MISSOURI W maint        2

Input:

data <- structure(list(Pad.Name = c(
  "MISSOURI W", "MISSOURI W", "MISSOURI W",
  "LEE", "LEE", "LEE"
), Message = c(
  "pump maint", "PUMP MAINT", "Pump Maintenance",
  "waiting on wireline",
  "seating the ball", "Waiting on wireline"
)), row.names = 11:16, class = "data.frame")

^{Created on 2023-01-26 with reprex v2.0.2}

CodePudding user response：

You can split by Pad.Name, strsplit the string and count the words using table.

. <- split(dat, dat$Pad.Name)
. <- lapply(., \(s) data.frame(row.names = NULL, s["Pad.Name"],
  setNames(stack(table(unlist(strsplit(tolower(s$Message), " "))))[2:1],
           c("Word", "Count") )))
. <- do.call(rbind, unname(.))
head(.[order(.$Count, .$Word, decreasing = TRUE),], 10)
#    Pad.Name        Word Count
#9 MISSOURI W        pump     3
#7 MISSOURI W       maint     2
#6        LEE    wireline     2
#5        LEE     waiting     2
#2        LEE          on     2
#8 MISSOURI W maintenance     1
#4        LEE         the     1
#3        LEE     seating     1
#1        LEE        ball     1

Data

dat <- structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W", 
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance", 
"waiting on wireline", 
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")