I have a dataset where i have a column (Continent) and i which to rename some of the data within this column, how would i do this. Data and example below;
I currently have these as the continents for countries in my dataset, i with to rename them so Australia would take Oceana instead of Western pacific, and Afghanistan would take Asia and not East Mediterranean. Africa Americas East Mediterranean Europe South East Asia Western Pacific
Part of my dataset here; head(all_data,3)
Country Year Continent Life_Expectancy
1 Afghanistan 2010 Eastern Mediterranean 61.17996
2 Afghanistan 2011 Eastern Mediterranean 61.72234
3 Afghanistan 2012 Eastern Mediterranean 62.20652
tail(all_data,1)
Country Year Continent Life_Expectancy
4705 Zimbabwe 2010 Africa 52.91785
CodePudding user response:
With case_when
you could extend: (more conditions):
library(dplyr)
df %>%
mutate(Continent = case_when(Country == "Afghanistan" ~ "Asia",
Country == "Australia" ~ "Oceana",
TRUE ~ Continent))
Country Year Continent Life_Expectancy
1 Afghanistan 2010 Asia 61.17996
2 Afghanistan 2011 Asia 61.72234
3 Afghanistan 2012 Asia 62.20652
4 Australia 2012 Oceana 43.22200
data:
df <- structure(list(Country = c("Afghanistan", "Afghanistan", "Afghanistan",
"Australia"), Year = c(2010L, 2011L, 2012L, 2012L), Continent = c("Eastern Mediterranean",
"Eastern Mediterranean", "Eastern Mediterranean", "Western pacific"
), Life_Expectancy = c(61.17996, 61.72234, 62.20652, 43.222)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
CodePudding user response:
library(data.table)
setDT(df)
df[Country == 'Afghanistan', Continent := 'Asia'
][Country == 'Australia', Continent := 'Oceana'
]
Any Country
not covered by our logic above Continent
would keep its original value. Also note latter statements take precedence.
The advantage to this method is speed (scalability). In our benchmark with 20 million rows:
# dummy data
x <- 1e7
df <- data.table(Country = rep(c('Afghanistan', 'Australia'), x)
, Continent = rep(c('x', 'y'), x)
); df
# benchmark
library(dplyr)
library(data.table)
library(microbenchmark)
library(ggplot2)
xx <-
microbenchmark(dplyr_case = {df %>%
mutate(Continent = case_when(Country == "Afghanistan" ~ "Asia"
, Country == "Australia" ~ "Oceana"
, TRUE ~ Continent
)
)
}
, dt_subset = {df[Country == 'Afghanistan', Continent := 'Asia'
][Country == 'Australia', Continent := 'Oceana'
]
}
, times = 10
)
# plot
autoplot(xx)