Hey I've got this table:
country date cases
----------------------------------------------------------
USA 2022-05-01 5
Benin 2022-05-28 2
USA 2021-05-17 3
USA 2022-05-05 7
Benin 2022-02-11 3
I want to group by country and then calculate the total number of cases per country each month of each year (without days where the month & year will be in a single column called date.):
country date cases
----------------------------------------------------------
USA 2022-05 12
USA 2021-05 3
Benin 2022-05 2
Benin 2022-02 3
I've recently asked a pre-question that was to group cases per each month and someone helped me with that code:
bymonth <- aggregate(cbind(cases) ~ substr(date, 1, 7), data=data1,
FUN=sum)
that was good but it sums up all cases in all countries per each month. I want to sum up for each country, each month of each year like in the second table above.
the name of the data is data1 and the columns are 'cases'/'date' (class DATE)/'country'
CodePudding user response:
If the date
column is of type Date
, then you can use format(date, "%Y-%m")
to foramt it as "yyyy-mm" (It's no longer a Date
object, but a character string).
aggregate(cases ~ country date2,
transform(df, date2 = format(date, "%Y-%m")),
FUN = sum)
# country date2 cases
# 1 USA 2021-05 3
# 2 Benin 2022-02 3
# 3 Benin 2022-05 2
# 4 USA 2022-05 12
If the date
column is of type character
, you should convert it to Date
by as.Date(date)
in advance.
If you have attached dplyr
, then the above method is equivalent to:
df %>%
count(country, date2 = format(date, "%Y-%m"), wt = cases, name = "cases")
# country date2 cases
# 1 Benin 2022-02 3
# 2 Benin 2022-05 2
# 3 USA 2021-05 3
# 4 USA 2022-05 12
where count()
is a shortcut of group_by(...) %>% summarise(cases = sum(cases))
.
Data
df = data.frame(
country = c('USA', 'Benin', 'USA', 'USA', 'Benin'),
date = as.Date(c('2022-05-01', '2022-05-28', '2021-05-17', '2022-05-05', '2022-02-11')),
cases = c(5, 2, 3, 7, 3)
)
CodePudding user response:
The lubridate package is not really necessary, but it does make things a little simpler
library(lubridate)
aggregate(cases~country year(date) month(date),data=df1,sum)
country year(date) month(date) cases
1 Benin 2022 2 3
2 USA 2021 5 3
3 Benin 2022 5 2
4 USA 2022 5 12
CodePudding user response:
If you prefer tidyverse, you can create new grouping columns in group_by
function.
df = data.frame(
'country'= c('usa', 'usa', 'benin', 'benin', 'usa'),
'date' = c('2022-05-02', '2021-04-03', '2022-05-02', '2022-02-03', '2022-05-05'),
'cases' = c(12, 3, 2, 4, 10)
)
library(tidyverse)
library(lubridate)
df %>% group_by(
country,
year=year(date),
month=month(date)) %>%
summarise(sum=sum(cases)) %>%
ungroup() %>%
mutate(year_mon = paste(year, month, sep='-')) %>%
select(country, year_mon, sum)
#> `summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.
#> # A tibble: 4 × 3
#> country year_mon sum
#> <chr> <chr> <dbl>
#> 1 benin 2022-2 4
#> 2 benin 2022-5 2
#> 3 usa 2021-4 3
#> 4 usa 2022-5 22
Created on 2022-05-13 by the reprex package (v2.0.1)
CodePudding user response:
Here's another way with tidyverse
, where we can simultaneously convert the date object to year-month and create our grouping variables, then summarise.
library(tidyverse)
df %>%
group_by(country, date = format(as.Date(df$date), "%Y-%m")) %>%
summarise(cases = sum(cases, na.rm = TRUE))
Output
country date cases
<chr> <chr> <dbl>
1 Benin 2022-02 3
2 Benin 2022-05 2
3 USA 2021-05 3
4 USA 2022-05 12
Data
df <- structure(list(country = c("USA", "Benin", "USA", "USA", "Benin"
), date = structure(c(19113, 19140, 18764, 19117, 19034), class = "Date"),
cases = c(5, 2, 3, 7, 3)), class = "data.frame", row.names = c(NA,
-5L))
CodePudding user response:
Here is how an alternative approach:
library(lubridate)
library(dplyr)
df %>%
mutate(Month_Yr = format_ISO8601(date, precision = "ym")) %>%
group_by(country, Month_Yr) %>%
summarise(cases = sum(cases))
country Month_Yr cases
<chr> <chr> <dbl>
1 Benin 2022-02 3
2 Benin 2022-05 2
3 USA 2021-05 3
4 USA 2022-05 12