Home > Enterprise >  Counting the occurrences of complex strings
Counting the occurrences of complex strings

Time:05-23

I have a dataframe containing information on country and year inputted by surveyors. The issue I'm having is that there is the option to input multiple countries, which people have done so in different formats (e.g. separated by a space, comma or "and"). I have a predefined list of countries. What I would like is to count the number of times each country occurs each year by recognising the string from the predefined list of countries. So here's what the dataframes I have look like.

Countries_predefined <- as.data.frame(c("Brazil", "Chile", "United States", "United Kingdom"))
colnames(Countries_predefined) <- "Country"

Surveyor_form <- as.data.frame(c("Brazil", "Brazil and United States", "United States United Kingdom", "Brazil, United Kingdom, United States"))
colnames(Surveyor_form) <- "Country"
Surveyor_form$Year <- c("1999", "1999", "2000", "2000")

I would like the final output to look like:

Country           1999    2000
Brazil              2       1
Chile               0       0
United States       1       2
United Kingdom      0       2

I read in this question R dplyr: Filter data by multiple Regex expressions defined by vector how to filter a data frame by a certain string, and I have had success using the regex.escape function below for recognising whether certain words are identified, but can't think how I can apply this to count the number of countries per year.

regex.escape <- function(string) {
  gsub("([][{}() *^$|\\\\?.])", "\\\\\\1", string)
}

CodePudding user response:

You can try.

t(table(stack(Map(function(x) Surveyor_form$Year[grep(x,Surveyor_form$Country)],
  Countries_predefined$Country))))
#                values
#ind              1999 2000
#  Brazil            2    1
#  Chile             0    0
#  United States     1    2
#  United Kingdom    0    2
  • Related