How to use summarise and n_distinct over cumulatively increasing subset of dataframe-CodePudding

I have a dafaframe called "CERS":

CERS = data.frame(IDs = c(seq(1, 10, by = 1)),
                 city = c("Vienna" , "Bratislava" , "Prague", "Budapest", "Amsterdam", "Turin", "Barcelona", "Bratislava", "Budapest", "Prague" ),
                 state = c("AT" , "SK" , "CZ", "HU", "NL", "IT", "ES", "SK", "HU", "CZ"), 
                 year = c(seq(2011, 2020, by = 1)))

I would like to count the number of distinct cities and states, within the increasing intervals of years, that is c(2011, 2011:2012, 2011:2013, 2011:2014, 2011:2015, ... 2011:2020) and save the results as a table.

How can I expand the code below to do the job? Answers outside of dplyr are also useful.

library(dplyr)
CERS %>% 
  filter(year>=2011 & year <= 2015) %>% 
  summarise(n_dist=n_distinct(city), n_dist_country = n_distinct(state))

CodePudding user response：

You could do:

CERS %>%
 arrange(year) %>%
 mutate(across(c(city, state), ~ !duplicated(.), .names = "n_dist_{.col}")) %>%
 group_by(year) %>%
 summarise(across(starts_with("n_dist"), sum)) %>%
 ungroup() %>%
 mutate(across(starts_with("n_dist"), cumsum))

    year n_dist_city n_dist_state
   <dbl>       <int>        <int>
 1  2011           1            1
 2  2012           2            2
 3  2013           3            3
 4  2014           4            4
 5  2015           5            5
 6  2016           6            6
 7  2017           7            7
 8  2018           7            7
 9  2019           7            7
10  2020           7            7

CodePudding user response：

Count the number of distinct city/state combinations in the 2011 to y range, where y varies from 2011 to 2020:

data.frame(
  year = 2011:2020,
  n = sapply(2011:2020, \(y) CERS %>% filter(between(year,2011,y)) %>% with(n_distinct(city,state)))
)

Output: