I have a data frame in R with a column that I need to do basic text analysis on. I am able to do this modifying the code as needed from this source. However, I now need to do this same analysis but for groups of data. I've included the dput
of a small sample here.
structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W",
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance",
"waiting on wireline",
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")
I want to group by the variable Pad.Name. I've tried using corpus_group
function from the quanteda
as well as the corpus
function from the same package, setting the parameters as follows: docid_field = dat$Pad.Name
and text_field = dat$Message
. Yet none of these seem to work.
My desired output are the most frequent words, say the top 10 most frequent, and a count of those words, for each unique Pad.Name. Similar something to as follows, however the true counts would work out, obviously:
edit: the table option never seems to work here, so here is a dput and data frame of my desired output
structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE",
"LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3,
2, 2, 2)), class = "data.frame", row.names = c(NA, -4L))
output <- data.frame(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", "LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3,2,2,2))
CodePudding user response:
Would dplyr and tidytext do?
library(tidytext)
library(dplyr)
as_tibble(data) %>%
# split to words
unnest_tokens(word,Message) %>%
# filter out stopwords
anti_join(get_stopwords()) %>%
# count by (Pad.Name, word) groups
count(Pad.Name, word, name = "Count", sort = T) %>%
# output is sorted by Count, no grouping, keep top-4
slice_head(n = 4) %>%
arrange(Pad.Name, desc(Count))
#> Joining, by = "word"
#> # A tibble: 4 × 3
#> Pad.Name word Count
#> <chr> <chr> <int>
#> 1 LEE waiting 2
#> 2 LEE wireline 2
#> 3 MISSOURI W pump 3
#> 4 MISSOURI W maint 2
Input:
data <- structure(list(Pad.Name = c(
"MISSOURI W", "MISSOURI W", "MISSOURI W",
"LEE", "LEE", "LEE"
), Message = c(
"pump maint", "PUMP MAINT", "Pump Maintenance",
"waiting on wireline",
"seating the ball", "Waiting on wireline"
)), row.names = 11:16, class = "data.frame")
Created on 2023-01-26 with reprex v2.0.2
CodePudding user response:
You can split
by Pad.Name, strsplit
the string and count the words using table
.
. <- split(dat, dat$Pad.Name)
. <- lapply(., \(s) data.frame(row.names = NULL, s["Pad.Name"],
setNames(stack(table(unlist(strsplit(tolower(s$Message), " "))))[2:1],
c("Word", "Count") )))
. <- do.call(rbind, unname(.))
head(.[order(.$Count, .$Word, decreasing = TRUE),], 10)
# Pad.Name Word Count
#9 MISSOURI W pump 3
#7 MISSOURI W maint 2
#6 LEE wireline 2
#5 LEE waiting 2
#2 LEE on 2
#8 MISSOURI W maintenance 1
#4 LEE the 1
#3 LEE seating 1
#1 LEE ball 1
Data
dat <- structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W",
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance",
"waiting on wireline",
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")