Home > Enterprise >  Most commonly mentioned countries in the corpus; extracting country names from abstracts R
Most commonly mentioned countries in the corpus; extracting country names from abstracts R

Time:10-08

I have a corpus of a couple of thousand documents and I'm trying to find the most commonly mentioned countries in the abstracts.

The library countrycode seems to have a comprehensive list of country names I can match against:

# country.name.alt shows multiple potential namings for 'Congo' (yay!):
install.packages(countrycode)
countrycode::countryname_dict |> filter(grepl('congo', tolower(country.name.alt)))
# Also seems to work for ones like "China"/"People's Republic of China"

A reprex of the data looks something like this:

df <- data.frame(entry_number = 1:5,
                 text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
                          "More text that might contain myanmar or burma, as well as thailand",
                          "sentences that do not contain a country name can be returned as NA",
                          "some variant of U.S or the united states",
                          "something with an accent samóoa"))

I want to reduce each entry in the column "text" to contain only a country name. Ideally something like this (note the repeat entry number):

desired_df <- data.frame(entry_number = c(1, 2, 2, 3, 4, 5),
                     text = c("congo",
                              "myanmar",
                              "thailand",
                              NA,
                              "united states",
                              "samoa"))

I've attempted with str_extract and various other failed attempts! The corpus is in English but international alphabets included in countrycode::countryname_dict$country.name.alt do throw reges errors. countrycode::countryname_dict$country.name.alt contains all the alternatives that countrycode::countryname_dict$country.name.en does not...

Open to any approach (dplyr,data.table...) that answers the initial question of how many times each country is mentioned in the corpus. Only requirement is that it is as robust as possible to different potential country names, accents and any other hidden catches!

Thanks community!

P.S, I have reviewed the following questions but no luck with my own example:

CodePudding user response:

This seeems to work well on example data.

library(tidyverse)

all_country <- countrycode::countryname_dict %>% 
                  filter(grepl('[A-Za-z]', country.name.alt)) %>%
                  pull(country.name.alt) %>% 
                  tolower()
pattern <- str_c(all_country, collapse = '|')

df %>%
  mutate(country = str_extract_all(tolower(text), pattern)) %>%
  select(-text) %>%
  unnest(country, keep_empty = TRUE)

#  entry_number country                     
#         <int> <chr>                       
#1            1 congo                       
#2            1 democratic republic of congo
#3            2 myanma                      
#4            2 burma                       
#5            2 thailand                    
#6            3 NA                          
#7            4 united states               
#8            5 samóoa                 
  • Related