I have a dafaframe called "CERS":
CERS = data.frame(IDs = c(seq(1, 10, by = 1)),
city = c("Vienna" , "Bratislava" , "Prague", "Budapest", "Amsterdam", "Turin", "Barcelona", "Bratislava", "Budapest", "Prague" ),
state = c("AT" , "SK" , "CZ", "HU", "NL", "IT", "ES", "SK", "HU", "CZ"),
year = c(seq(2011, 2020, by = 1)))
I would like to count the number of distinct cities and states, within the increasing intervals of years, that is c(2011, 2011:2012, 2011:2013, 2011:2014, 2011:2015, ... 2011:2020) and save the results as a table.
How can I expand the code below to do the job? Answers outside of dplyr are also useful.
library(dplyr)
CERS %>%
filter(year>=2011 & year <= 2015) %>%
summarise(n_dist=n_distinct(city), n_dist_country = n_distinct(state))
CodePudding user response:
You could do:
CERS %>%
arrange(year) %>%
mutate(across(c(city, state), ~ !duplicated(.), .names = "n_dist_{.col}")) %>%
group_by(year) %>%
summarise(across(starts_with("n_dist"), sum)) %>%
ungroup() %>%
mutate(across(starts_with("n_dist"), cumsum))
year n_dist_city n_dist_state
<dbl> <int> <int>
1 2011 1 1
2 2012 2 2
3 2013 3 3
4 2014 4 4
5 2015 5 5
6 2016 6 6
7 2017 7 7
8 2018 7 7
9 2019 7 7
10 2020 7 7
CodePudding user response:
Count the number of distinct city/state combinations in the 2011 to y range, where y varies from 2011 to 2020:
data.frame(
year = 2011:2020,
n = sapply(2011:2020, \(y) CERS %>% filter(between(year,2011,y)) %>% with(n_distinct(city,state)))
)
Output:
year n
1 2011 1
2 2012 2
3 2013 3
4 2014 4
5 2015 5
6 2016 6
7 2017 7
8 2018 7
9 2019 7
10 2020 7