I have a dataframe with strings in a column. How can I extract only uppercase substrings before a number and add them to another column? An example is for DE, but there are many more country abbreviations and they always appear before a number.
TD<-data.frame(a=c("WHATEVERDE 11111","","Whatever DE 11111","DE 11111",""),
b=c("","What DE EverDE 1111","","",""),
c=c("Whatever","","","","WhateverDE 11111"))
And I would like to create another column as follows:
> TD
a b c result
1 WHATEVERDE 11111 Whatever DE
2 What DE EverDE 1111 DE
3 Whatever DE 11111 DE
4 DE 11111 DE
5 WhateverDE 11111 DE
I tried to apply the solution:
sub("^([[:alpha:]]*).*", "\\1", "DE 11111") but is not universal.
Vector with abbreviations:
names<-c('AT','BE','DE','BG','CZ','DK','FR','GR','ES','NL','HU','GB','IT')
CodePudding user response:
We loop across
the columns, extract the 2 letter uppercase countrycode substring that precedes zero or more spaces and one or more digits, coalesce
the output so that it returns the first non-NA extracted element per row
library(dplyr)
library(stringr)
library(purrr)
library(countrycode)
pat <- countrycode::codelist %>%
pull(iso2c) %>%
na.omit %>%
str_c(collapse = "|") %>%
sprintf(fmt = "(%s)(?=\\s*\\d )")
TD %>%
mutate(result = invoke(coalesce,
across(everything(), ~ str_extract(., pat))))
-output
a b c result
1 WHATEVERDE 11111 Whatever DE
2 What DE EverDE 1111 DE
3 Whatever DE 11111 DE
4 DE 11111 DE
5 WhateverDE 11111 DE