Home > Mobile >  How can I change some of the names of variables within columns?
How can I change some of the names of variables within columns?


I have a dataset where i have a column (Continent) and i which to rename some of the data within this column, how would i do this. Data and example below;

I currently have these as the continents for countries in my dataset, i with to rename them so Australia would take Oceana instead of Western pacific, and Afghanistan would take Asia and not East Mediterranean. Africa Americas East Mediterranean Europe South East Asia Western Pacific

Part of my dataset here; head(all_data,3)

     Country Year             Continent Life_Expectancy 
1 Afghanistan 2010 Eastern Mediterranean        61.17996                 
2 Afghanistan 2011 Eastern Mediterranean        61.72234       
3 Afghanistan 2012 Eastern Mediterranean        62.20652        


      Country Year Continent Life_Expectancy 
4705 Zimbabwe 2010    Africa        52.91785          

CodePudding user response:

With case_when you could extend: (more conditions):

df %>% 
  mutate(Continent = case_when(Country == "Afghanistan" ~ "Asia",
                               Country == "Australia" ~ "Oceana",
                               TRUE ~ Continent))

      Country Year Continent Life_Expectancy
1 Afghanistan 2010      Asia        61.17996
2 Afghanistan 2011      Asia        61.72234
3 Afghanistan 2012      Asia        62.20652
4   Australia 2012    Oceana        43.22200


df <- structure(list(Country = c("Afghanistan", "Afghanistan", "Afghanistan", 
"Australia"), Year = c(2010L, 2011L, 2012L, 2012L), Continent = c("Eastern Mediterranean", 
"Eastern Mediterranean", "Eastern Mediterranean", "Western pacific"
), Life_Expectancy = c(61.17996, 61.72234, 62.20652, 43.222)), class = "data.frame", row.names = c("1", 
"2", "3", "4"))

CodePudding user response:



df[Country == 'Afghanistan', Continent := 'Asia'
   ][Country == 'Australia', Continent := 'Oceana'

Any Country not covered by our logic above Continent would keep its original value. Also note latter statements take precedence.

The advantage to this method is speed (scalability). In our benchmark with 20 million rows:

# dummy data
x <- 1e7

df <- data.table(Country = rep(c('Afghanistan', 'Australia'), x)
                 , Continent = rep(c('x', 'y'), x)
                 ); df

# benchmark

xx <-
microbenchmark(dplyr_case = {df %>%
                                mutate(Continent = case_when(Country == "Afghanistan" ~ "Asia"
                                                             , Country == "Australia" ~ "Oceana"
                                                             , TRUE ~ Continent
               , dt_subset = {df[Country == 'Afghanistan', Continent := 'Asia'
                                 ][Country == 'Australia', Continent := 'Oceana'
               , times = 10

# plot


  •  Tags:  
  • r
  • Related