Home > database >  Removing rows from data frame containing strictly uppercase letters (in a specified column) using R?
Removing rows from data frame containing strictly uppercase letters (in a specified column) using R?

Time:02-16

I have a very large and messy dataset containing both country names and regions in a column named 'country.' I need to eliminate the regions, but leave the countries. Fortunately, the regions are written in all uppercase letters, so they can be distinguished from the countries, which only have one uppercase letter at the beginning.

How can I remove rows with data$country entries as entirely uppercase letters?

Here is an example of my dataset:

data <- data.frame(year=c(1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990,
                             1990),
                   country = c('SUB-SAHARAN AFRICA',
                               'Eastern Africa',
                               'Burundi',
                               'Comoros',
                               'Djibouti',
                               'Eritrea',
                               'Ethiopia',
                               'Kenya',
                               'Madagascar',
                               'Malawi',
                               'Mauritius',
                               'Mayotte',
                               'Mozambique',
                               'Réunion',
                               'Rwanda',
                               'Seychelles',
                               'Somalia',
                               'South Sudan',
                               'Uganda',
                               'United Republic of Tanzania',
                               'Zambia',
                               'Zimbabwe',
                               'Middle Africa',
                               'Angola',
                               'Cameroon',
                               'Central African Republic',
                               'Chad',
                               'Congo',
                               'Democratic Republic of the Congo',
                               'Equatorial Guinea',
                               'Gabon',
                               'Sao Tome and Principe',
                               'Southern Africa',
                               'Botswana',
                               'Eswatini',
                               'Lesotho',
                               'Namibia',
                               'South Africa',
                               'Western Africa',
                               'Benin',
                               'Burkina Faso',
                               'CAPITAL FOR EXAMPLE SAKE',
                               'CAPITAL FOR EXAMPLE SAKE',
                               'CAPITAL FOR EXAMPLE SAKE'),
                   entry = c(123,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0,
                             123,
                             0,
                             0,
                             0,
                             64,
                             59,
                             0,
                             0,
                             0,
                             0,
                             0,
                             0))

I tried using the grepl function, as this post advised...

dropped <- data[!grepl("^[A-Z ] $", data$country), drop = TRUE]

...however, I get the following error:

Error in `[.data.frame`(data, !grepl("^[A-Z ] $", data$country), drop = TRUE) : 
  undefined columns selected
In addition: Warning message:
In `[.data.frame`(data, !grepl("^[A-Z ] $", data$country), drop = TRUE) :
  'drop' argument will be ignored

How can I remove these rows?

CodePudding user response:

Use grepl and take a subset:

data <- data[!grepl("^[A-Z] (?:[ -][A-Z] )*$", data$country), ]
  •  Tags:  
  • r
  • Related