I'd like to join two dataframes that list different countries. However, the countries' names are not consistent. One df has a single column for each countries' name, and the other has multiple columns with different names for that country. e.g.
df1 <-
tibble(
country = c("United Kingdom", "USA", "China", "India", "Russia"),
value = c(1:5))
df2 <-
tibble(
admin_full = c(
"United Kingdom",
"United States of America",
"People's Republic of China",
"India",
"Russian Federation"),
admin_short = c("UK", "United States", "China", "India", "Russia"),
admin_abbreviated = c("UK", "USA", "PRC", "India", "Russia"),
group = c("High", "High", "Medium", "Low", "Low"))
I would like to join the dfs where df1 matches any of the names in df2. So
- UK should match on admin_full
- USA should match on admin_abbreviated
- China should match on admin_short
- India should match on admin_full, admin_short and/or admin_abbreviated
- Russia should match on admin_short and/or admin_abbreviated
I'm working with a list of ~200 countries so I don't want to have to go through an manually pick out which one of a dozen columns has the correct format to match that particular country on. I want to match where the name in df1 matches any one of multiple columns in df2.
CodePudding user response:
It sounds like any of the three admin_
columns could match, so my approach here is to make an extra column from the admin_full
(a key for that country) and bring the three variations into different rows in a country
column. Then we can join the two tables, and use distinct
to keep only one match for each.
library(dplyr)
left_join(df1,
df2 %>%
mutate(full_name = admin_full) %>%
pivot_longer(-c(full_name, group), values_to ="country")
) %>%
distinct(country, value, group, full_name)
Result
# A tibble: 5 × 4
country value group full_name
<chr> <int> <chr> <chr>
1 United Kingdom 1 High United Kingdom
2 USA 2 High United States of America
3 China 3 Medium People's Republic of China
4 India 4 Low India
5 Russia 5 Low Russian Federation
Or if you don't need a "canonical" name, you could shorten:
left_join(df1, pivot_longer(df2, -group, values_to ="country")) %>%
distinct(country, value, group)
# A tibble: 5 × 3
country value group
<chr> <int> <chr>
1 United Kingdom 1 High
2 USA 2 High
3 China 3 Medium
4 India 4 Low
5 Russia 5 Low