Home > Net >  How can I add the country name to a dataset based on city name and population?
How can I add the country name to a dataset based on city name and population?

Time:07-13

I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.

population    city
500,000       Oslo
750,000       Bristol
500,000       Liverpool
1,000,000     Dublin

I expect the output to look like this:

population    city          country
500,000       Oslo          Norway
750,000       Bristol       England
500,000       Liverpool     England
1,000,000     Dublin        Ireland 

How can I add a column of country names based on the city and population to a large dataset in R?

CodePudding user response:

I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.


library(maps)
library(dplyr)
data("world.cities")

df <- readr::read_table("population    city
500,000       Oslo
750,000       Bristol
500,000       Liverpool
1,000,000     Dublin")


df   |>
  inner_join(
    select(world.cities, name, country.etc, pop), 
    by = c("city" = "name")
  )  |> group_by(city)  |> 
  filter(
    abs(pop - population) == min(abs(pop - population))
    )
        
# A tibble: 4 x 4
# Groups:   city [4]
#   population city      country.etc     pop
#        <dbl> <chr>     <chr>         <int>
# 1     500000 Oslo      Norway       821445
# 2     750000 Bristol   UK           432967
# 3     500000 Liverpool UK           468584
# 4    1000000 Dublin    Ireland     1030431

CodePudding user response:

As stated by others, the cities exists in other countries too as well.

library(tidyverse)
library(maps)

data("world.cities")

df <- read_table("population    city
500,000       Oslo
750,000       Bristol
500,000       Liverpool
1,000,000     Dublin")

df %>% 
  merge(., world.cities %>%
          select(name, country.etc), 
        by.x = "city", 
        by.y = "name") 

# A tibble: 7 × 3
  city      population country.etc
  <chr>          <dbl> <chr>      
1 Bristol       750000 UK         
2 Bristol       750000 USA        
3 Dublin       1000000 USA        
4 Dublin       1000000 Ireland    
5 Liverpool     500000 UK         
6 Liverpool     500000 Canada     
7 Oslo          500000 Norway     

CodePudding user response:

I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

  • Related