I have a very large and messy dataset containing both country names and regions in a column named 'country.' I need to eliminate the regions, but leave the countries. Fortunately, the regions are written in all uppercase letters, so they can be distinguished from the countries, which only have one uppercase letter at the beginning.
How can I remove rows with data$country
entries as entirely uppercase letters?
Here is an example of my dataset:
data <- data.frame(year=c(1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990,
1990),
country = c('SUB-SAHARAN AFRICA',
'Eastern Africa',
'Burundi',
'Comoros',
'Djibouti',
'Eritrea',
'Ethiopia',
'Kenya',
'Madagascar',
'Malawi',
'Mauritius',
'Mayotte',
'Mozambique',
'Réunion',
'Rwanda',
'Seychelles',
'Somalia',
'South Sudan',
'Uganda',
'United Republic of Tanzania',
'Zambia',
'Zimbabwe',
'Middle Africa',
'Angola',
'Cameroon',
'Central African Republic',
'Chad',
'Congo',
'Democratic Republic of the Congo',
'Equatorial Guinea',
'Gabon',
'Sao Tome and Principe',
'Southern Africa',
'Botswana',
'Eswatini',
'Lesotho',
'Namibia',
'South Africa',
'Western Africa',
'Benin',
'Burkina Faso',
'CAPITAL FOR EXAMPLE SAKE',
'CAPITAL FOR EXAMPLE SAKE',
'CAPITAL FOR EXAMPLE SAKE'),
entry = c(123,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
123,
0,
0,
0,
64,
59,
0,
0,
0,
0,
0,
0))
I tried using the grepl
function, as this post advised...
dropped <- data[!grepl("^[A-Z ] $", data$country), drop = TRUE]
...however, I get the following error:
Error in `[.data.frame`(data, !grepl("^[A-Z ] $", data$country), drop = TRUE) :
undefined columns selected
In addition: Warning message:
In `[.data.frame`(data, !grepl("^[A-Z ] $", data$country), drop = TRUE) :
'drop' argument will be ignored
How can I remove these rows?
CodePudding user response:
Use grepl
and take a subset:
data <- data[!grepl("^[A-Z] (?:[ -][A-Z] )*$", data$country), ]