Home > database >  dplyr group_by retaining extra columns after summarise
dplyr group_by retaining extra columns after summarise

Time:06-21

I am at a total loss for this one - I am playing with the "pedestrian" dataset from tsibble. I want to get total counts for each month/year. I started by adding a month_year column, then summarise the data with sum, like so:

library("tidyverse")
library("tsibble")

df1 <- pedestrian
df1$month_year <- format(as.Date(df1$Date), "%Y-%m")

count_all <- df1 %>%  
  dplyr::group_by(month_year) %>% 
  dplyr::summarise(total = sum(Count))

A summary of count_all looks like this:

  month_year          Date_Time                         total      
 Length:17542       Min.   :2015-01-01 00:00:00.0   Min.   :   12  
 Class :character   1st Qu.:2015-07-02 17:15:00.0   1st Qu.:  349  
 Mode  :character   Median :2016-01-01 11:30:00.0   Median : 2090  
                    Mean   :2016-01-01 11:44:40.2   Mean   : 2593  
                    3rd Qu.:2016-07-02 04:45:00.0   3rd Qu.: 4455  
                    Max.   :2016-12-31 23:00:00.0   Max.   :15990

Why is Date_Time being retained? And how can I prevent it form impacting the summary (as in prevent it from giving me 17,542 rows instead of the expected 24). If I remove the column before the summary like so:

df1$Date_Time <- NULL

Then it works fine, and a summary of the result looks like this:

  month_year            total        
 Length:24          Min.   :1148276  
 Class :character   1st Qu.:1756898  
 Mode  :character   Median :1927154  
                    Mean   :1895161  
                    3rd Qu.:2066043  
                    Max.   :2393675  

This solution is fine, but I would like to know what the cause of the issue is so that I can avoid it in future (it was easy to catch the problem this time, but may not always be so straight forward).

Thanks in advance for the help!

CodePudding user response:

The dataset pedestrian is a tsibble with sensor as a key and Date_Time as the index. Any operation you do on the tsibble will retain the index. You can remove the index by converting back to a tibble.

pedestrian %>%
  as_tibble() %>% 
  mutate(ym = yearmonth(Date)) %>% 
  dplyr::group_by(ym) %>% 
  dplyr::summarise(total = sum(Count))
  • Related