How to SUM total number of a value per each month & year and join by other columns in R-CodePudding

Hey I've got this table:

     country              date              cases
----------------------------------------------------------
     USA              2022-05-01             5
     Benin            2022-05-28             2
     USA              2021-05-17             3
     USA              2022-05-05             7
     Benin            2022-02-11             3

I want to group by country and then calculate the total number of cases per country each month of each year (without days where the month & year will be in a single column called date.):

     country              date              cases
----------------------------------------------------------
      USA               2022-05               12
      USA               2021-05               3
      Benin             2022-05               2
      Benin             2022-02               3

I've recently asked a pre-question that was to group cases per each month and someone helped me with that code:

        bymonth <- aggregate(cbind(cases) ~ substr(date, 1, 7), data=data1, 
        FUN=sum)

that was good but it sums up all cases in all countries per each month. I want to sum up for each country, each month of each year like in the second table above.

the name of the data is data1 and the columns are 'cases'/'date' (class DATE)/'country'

CodePudding user response：

If the date column is of type Date, then you can use format(date, "%Y-%m") to foramt it as "yyyy-mm" (It's no longer a Date object, but a character string).

aggregate(cases ~ country   date2,
          transform(df, date2 = format(date, "%Y-%m")),
          FUN = sum)

#   country    date2 cases
# 1     USA  2021-05     3
# 2   Benin  2022-02     3
# 3   Benin  2022-05     2
# 4     USA  2022-05    12

If the date column is of type character, you should convert it to Date by as.Date(date) in advance.

If you have attached dplyr, then the above method is equivalent to:

df %>%
  count(country, date2 = format(date, "%Y-%m"), wt = cases, name = "cases")

#   country   date2 cases
# 1   Benin 2022-02     3
# 2   Benin 2022-05     2
# 3     USA 2021-05     3
# 4     USA 2022-05    12

where count() is a shortcut of group_by(...) %>% summarise(cases = sum(cases)).

Data

df = data.frame(
  country = c('USA', 'Benin', 'USA', 'USA', 'Benin'),
  date = as.Date(c('2022-05-01', '2022-05-28', '2021-05-17', '2022-05-05', '2022-02-11')),
  cases = c(5, 2, 3, 7, 3) 
)

CodePudding user response：

The lubridate package is not really necessary, but it does make things a little simpler

library(lubridate)
aggregate(cases~country year(date) month(date),data=df1,sum)

  country year(date) month(date) cases
1   Benin       2022           2     3
2     USA       2021           5     3
3   Benin       2022           5     2
4     USA       2022           5    12

CodePudding user response：

If you prefer tidyverse, you can create new grouping columns in group_by function.

df = data.frame(
    'country'= c('usa', 'usa', 'benin', 'benin', 'usa'),
    'date' = c('2022-05-02', '2021-04-03', '2022-05-02', '2022-02-03', '2022-05-05'),
    'cases' = c(12, 3, 2, 4, 10) 
)

library(tidyverse)
library(lubridate)

df %>% group_by(
    country, 
    year=year(date), 
    month=month(date)) %>% 
    summarise(sum=sum(cases)) %>% 
    ungroup() %>% 
    mutate(year_mon = paste(year, month, sep='-')) %>% 
    select(country, year_mon, sum)
#> `summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.
#> # A tibble: 4 × 3
#>   country year_mon   sum
#>   <chr>   <chr>    <dbl>
#> 1 benin   2022-2       4
#> 2 benin   2022-5       2
#> 3 usa     2021-4       3
#> 4 usa     2022-5      22

^{Created on 2022-05-13 by the reprex package (v2.0.1)}

CodePudding user response：

Here's another way with tidyverse, where we can simultaneously convert the date object to year-month and create our grouping variables, then summarise.

library(tidyverse)

df %>%
  group_by(country, date = format(as.Date(df$date), "%Y-%m")) %>%
  summarise(cases = sum(cases, na.rm = TRUE))

Output

  country date    cases
  <chr>   <chr>   <dbl>
1 Benin   2022-02     3
2 Benin   2022-05     2
3 USA     2021-05     3
4 USA     2022-05    12

Data

df <- structure(list(country = c("USA", "Benin", "USA", "USA", "Benin"
), date = structure(c(19113, 19140, 18764, 19117, 19034), class = "Date"), 
    cases = c(5, 2, 3, 7, 3)), class = "data.frame", row.names = c(NA, 
-5L))

CodePudding user response：

Here is how an alternative approach:

library(lubridate)
library(dplyr)

df %>% 
  mutate(Month_Yr = format_ISO8601(date, precision = "ym")) %>% 
  group_by(country, Month_Yr) %>% 
  summarise(cases = sum(cases))

   country Month_Yr cases
  <chr>   <chr>    <dbl>
1 Benin   2022-02      3
2 Benin   2022-05      2
3 USA     2021-05      3
4 USA     2022-05     12