I am at a total loss for this one - I am playing with the "pedestrian" dataset from tsibble
. I want to get total counts for each month/year. I started by adding a month_year
column, then summarise the data with sum, like so:
library("tidyverse")
library("tsibble")
df1 <- pedestrian
df1$month_year <- format(as.Date(df1$Date), "%Y-%m")
count_all <- df1 %>%
dplyr::group_by(month_year) %>%
dplyr::summarise(total = sum(Count))
A summary of count_all looks like this:
month_year Date_Time total
Length:17542 Min. :2015-01-01 00:00:00.0 Min. : 12
Class :character 1st Qu.:2015-07-02 17:15:00.0 1st Qu.: 349
Mode :character Median :2016-01-01 11:30:00.0 Median : 2090
Mean :2016-01-01 11:44:40.2 Mean : 2593
3rd Qu.:2016-07-02 04:45:00.0 3rd Qu.: 4455
Max. :2016-12-31 23:00:00.0 Max. :15990
Why is Date_Time
being retained? And how can I prevent it form impacting the summary (as in prevent it from giving me 17,542 rows instead of the expected 24). If I remove the column before the summary like so:
df1$Date_Time <- NULL
Then it works fine, and a summary of the result looks like this:
month_year total
Length:24 Min. :1148276
Class :character 1st Qu.:1756898
Mode :character Median :1927154
Mean :1895161
3rd Qu.:2066043
Max. :2393675
This solution is fine, but I would like to know what the cause of the issue is so that I can avoid it in future (it was easy to catch the problem this time, but may not always be so straight forward).
Thanks in advance for the help!
CodePudding user response:
The dataset pedestrian
is a tsibble with sensor as a key and Date_Time as the index. Any operation you do on the tsibble will retain the index. You can remove the index by converting back to a tibble.
pedestrian %>%
as_tibble() %>%
mutate(ym = yearmonth(Date)) %>%
dplyr::group_by(ym) %>%
dplyr::summarise(total = sum(Count))