Home > Blockchain >  How to extract and plot the count of top 20 most common words (uni, bi, tri, ngram) from a column us
How to extract and plot the count of top 20 most common words (uni, bi, tri, ngram) from a column us

Time:06-25

This is a reproducible dataframe that I'm working...

covid <- structure(list(Refid = c(32740925L, 32891569L, 2007846266L, 2007846378L, 
2007856056L, 2007858108L, 2007863577L, 2007872004L, 2007872013L, 
2007915036L, 2007915277L, 2007916087L, 2007916147L, 2007916184L, 
2007916258L, 2007916285L, 2007916333L, 2007916710L, 2007917006L, 
2007918143L, 2007920589L, 2007921553L, 2007967876L, 2007967891L, 
2007967904L, 2007968097L, 2007968362L, 2007968557L, 2007968993L, 
2008010059L, 2008010956L, 2008010970L, 2008011456L, 2008011614L, 
2008011632L), Title = c("Telemedicine in Otolaryngology in the COVID-19 Era: Initial Lessons Learned.", 
"Paracervical blocks facilitate timely brachytherapy amidst COVID-19.", 
"The Perils of Covid-19 for Otorhinolaryngologists: An Overview.", 
"Air care: an 'aerography' of breath, buildings and bugs in the cystic fibrosis clinic.", 
"Breath analysis for detection of viral infection, the current position of the field.", 
"An epidemiological study to assess the prevalence of diabetic peripheral neuropathic pain among adults with diabetes attending private and institutional outpatient clinics in South Africa.", 
"Concerns and strategies for wastewater treatment during COVID-19 pandemic to stop plausible transmission", 
"A Chemoenzymatic Synthesis of the (RP)-Isomer of the Antiviral Prodrug Remdesivir", 
"Clinical characteristics and outcome of hemodialysis patients with COVID-19: a large cohort study in a single Chinese center", 
"Impact of COVID-19 pandemic on waste management.", "Long-Lasting, Patient-Controlled, Procedure-Free Contraception: A Review of Annovera with a Pharmacist Perspective.", 
"Bacillus Calmette-Guerin (BCG) vaccine generates immunoregulatory cells in the cervical lymph nodes in guinea pigs injected intra dermally.", 
"Comparison analysis of different swabs and transport mediums suitable for SARS-CoV-2 testing following shortages.", 
"Coronaviruses widespread on nonliving surfaces: important questions and promising answers.", 
"A Surface Coating that Rapidly Inactivates SARS-CoV-2.", "Plexiglas barrier box to improve ERCP safety during the COVID-19 pandemic.", 
"COVID-19 Pandemic Repercussions on the Use and Management of Plastics.", 
"Simple, Low-Cost and Long-Lasting Film for Virus Inactivation Using Avian Coronavirus Model as Challenge.", 
"Cytokine storm intervention in the early stages of COVID-19 pneumonia.", 
"In vitro measurement of the permeability of endovascular coils deployed in cerebral aneurysms.", 
"A new system of microwave ablation at 2450 MHz: preliminary experience.", 
"Endovascular treatment of 404 intracranial aneurysms treated with nexus detachable coils: short-term and mid-term results from a prospective, consecutive, European multicenter study.", 
"Environmentally friendly non-medical mask: An attempt to reduce the environmental impact from used masks during COVID 19 pandemic", 
"Plastic residues produced with confirmatory testing for COVID-19: Classification, quantification, fate, and impacts on human health", 
"What we need to know about PPE associated with the COVID-19 pandemic in the marine environment", 
"Peroral endoscopy during the COVID-19 pandemic: Efficacy of the acrylic box (Endo-Splash Protective (ESP) box) for preventing droplet transmission", 
"Disinfection of gloved hands during the COVID-19 pandemic", 
"Cysteine focused covalent inhibitors against the main protease of SARS-CoV-2", 
"Just the Facts: Recommendations on point-of-care ultrasound use and machine infection control during the coronavirus disease 2019 pandemic", 
"A comprehensive risk assessment of toxic elements in international brands of face foundation powders", 
"Pediatric E.N.T. emergencies during COVID-19 pandemic: our experience.", 
"Collective aeromedical transport of COVID-19 critically ill patients in Europe: A retrospective study.", 
"Severity of COVID-19 at elevated exposure to perfluorinated alkylates.", 
"Assessment of water and sanitation systems at Palestinian healthcare facilities: pre- and post-COVID-19.", 
"Drinking water pollutants may affect the immune system: concerns regarding COVID-19 health effects."
)), class = "data.frame", row.names = c(NA, -35L))

I tried the solution from https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html but the top 5 output from the result give me

library(lattice)
stats <- txt_freq(covid$Title)

structure(list(key = structure(35:30, .Label = c("Drinking water pollutants may affect the immune system: concerns regarding COVID-19 health effects.", 
"Assessment of water and sanitation systems at Palestinian healthcare facilities: pre- and post-COVID-19.", 
"Severity of COVID-19 at elevated exposure to perfluorinated alkylates.", 
"Collective aeromedical transport of COVID-19 critically ill patients in Europe: A retrospective study.", 
"Pediatric E.N.T. emergencies during COVID-19 pandemic: our experience.", 
"A comprehensive risk assessment of toxic elements in international brands of face foundation powders", 
"Just the Facts: Recommendations on point-of-care ultrasound use and machine infection control during the coronavirus disease 2019 pandemic", 
"Cysteine focused covalent inhibitors against the main protease of SARS-CoV-2", 
"Disinfection of gloved hands during the COVID-19 pandemic", 
"Peroral endoscopy during the COVID-19 pandemic: Efficacy of the acrylic box (Endo-Splash Protective (ESP) box) for preventing droplet transmission", 
"What we need to know about PPE associated with the COVID-19 pandemic in the marine environment", 
"Plastic residues produced with confirmatory testing for COVID-19: Classification, quantification, fate, and impacts on human health", 
"Environmentally friendly non-medical mask: An attempt to reduce the environmental impact from used masks during COVID 19 pandemic", 
"Endovascular treatment of 404 intracranial aneurysms treated with nexus detachable coils: short-term and mid-term results from a prospective, consecutive, European multicenter study.", 
"A new system of microwave ablation at 2450 MHz: preliminary experience.", 
"In vitro measurement of the permeability of endovascular coils deployed in cerebral aneurysms.", 
"Cytokine storm intervention in the early stages of COVID-19 pneumonia.", 
"Simple, Low-Cost and Long-Lasting Film for Virus Inactivation Using Avian Coronavirus Model as Challenge.", 
"COVID-19 Pandemic Repercussions on the Use and Management of Plastics.", 
"Plexiglas barrier box to improve ERCP safety during the COVID-19 pandemic.", 
"A Surface Coating that Rapidly Inactivates SARS-CoV-2.", "Coronaviruses widespread on nonliving surfaces: important questions and promising answers.", 
"Comparison analysis of different swabs and transport mediums suitable for SARS-CoV-2 testing following shortages.", 
"Bacillus Calmette-Guerin (BCG) vaccine generates immunoregulatory cells in the cervical lymph nodes in guinea pigs injected intra dermally.", 
"Long-Lasting, Patient-Controlled, Procedure-Free Contraception: A Review of Annovera with a Pharmacist Perspective.", 
"Impact of COVID-19 pandemic on waste management.", "Clinical characteristics and outcome of hemodialysis patients with COVID-19: a large cohort study in a single Chinese center", 
"A Chemoenzymatic Synthesis of the (RP)-Isomer of the Antiviral Prodrug Remdesivir", 
"Concerns and strategies for wastewater treatment during COVID-19 pandemic to stop plausible transmission", 
"An epidemiological study to assess the prevalence of diabetic peripheral neuropathic pain among adults with diabetes attending private and institutional outpatient clinics in South Africa.", 
"Breath analysis for detection of viral infection, the current position of the field.", 
"Air care: an 'aerography' of breath, buildings and bugs in the cystic fibrosis clinic.", 
"The Perils of Covid-19 for Otorhinolaryngologists: An Overview.", 
"Paracervical blocks facilitate timely brachytherapy amidst COVID-19.", 
"Telemedicine in Otolaryngology in the COVID-19 Era: Initial Lessons Learned."
), class = "factor"), freq = c(1L, 1L, 1L, 1L, 1L, 1L), freq_pct = c(2.85714285714286, 
2.85714285714286, 2.85714285714286, 2.85714285714286, 2.85714285714286, 
2.85714285714286)), row.names = c(NA, 6L), class = "data.frame")

It's not concatenating all the titles in the column to analyse the most common words (seemed it's seeing each title in a row is unique). I then tried the solution from R extract most common word(s) / ngrams in a column by group but I stuck at the very beginning with this error "Error in group_by(., group) : object 'topic_modelling' not found".

Can someone give me some advices?

CodePudding user response:

library(tidyverse)
library(tidytext)
tibble(covid) %>%
  unnest_tokens(words, Title)%>%
  count(words, sort = TRUE)%>%
  slice_max(n = 20, order_by = n)

# A tibble: 20 × 2
   words        n
   <chr>    <int>
 1 of          26
 2 the         23
 3 19          19
 4 covid       19
 5 and         13
 6 in          13
 7 a           10
 8 pandemic    10
 9 during       7
10 for          7
11 to           6
12 with         6
13 on           5
14 an           4
15 study        4
16 2            3
17 at           3
18 box          3
19 cov          3
20 sars         3

Looking at your list, you most likely want to to remove stop words:

library(tm)

token_counts <- tibble(covid) %>%
  mutate(Title = tm::removeWords(Title, tm::stopwords()),
         Title = str_squish(str_trim(Title)))%>%
  tidytext::unnest_tokens(words, Title)%>%
  count(words, sort = TRUE)%>%
  slice_max(n = 20, order_by = n, with_ties = FALSE)

# A tibble: 20 × 2
   words            n
   <chr>        <int>
 1 19              19
 2 covid           19
 3 pandemic        10
 4 a                6
 5 study            4
 6 2                3
 7 an               3
 8 box              3
 9 cov              3
10 sars             3
11 analysis         2
12 aneurysms        2
13 assessment       2
14 breath           2
15 care             2
16 coils            2
17 concerns         2
18 coronavirus      2
19 endovascular     2
20 experience       2

Which you can then plot in a variety of ways - one such way being:

token_counts %>%
  ggplot(aes(y = reorder(words, n), x = n)) 
  geom_col() 
  theme_bw() 
  labs(x = "Word count", y = NULL)

For ngrams, either edit the unnest_tokens() call's arguments to token = "ngrams", n = 2 - or just use unnest_ngrams():


ngram_counts <- tibble(covid) %>%
  tidytext::unnest_ngrams(bigrams, Title, n = 2)%>%
  count(bigrams, sort = TRUE)%>%
  slice_max(n = 20, order_by = n, with_ties = FALSE)

# A tibble: 20 × 2
   bigrams               n
   <chr>             <int>
 1 covid 19             19
 2 19 pandemic           9
 3 in the                5
 4 of covid              5
 5 of the                5
 6 the covid             5
 7 during the            4
 8 cov 2                 3
 9 during covid          3

Clipped some for brevity - you may also want to remove stop words for the bigrams or create your own stop word list/run other cleaning steps.

  • Related