Home > Enterprise >  R: Calculate sum over a column based on groups for panel data where one group has no data
R: Calculate sum over a column based on groups for panel data where one group has no data

Time:04-17

I have the following data:

structure(list(Name = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018", 
"02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", "07.11.2021", 
"01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", 
"07.11.2021", "01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", 
"06.11.2021", "07.11.2021"), Category = c(1L, 1L, NA, NA, 2L, 
2L, 3L, 3L, 3L, NA, NA, 4L, 4L, 2L, NA, 2L, 2L, NA), Size = c(34L, 
23L, 12L, 53L, 23L, 53L, 23L, 54L, 65L, 75L, 67L, 45L, 45L, 23L, 
23L, 12L, 12L, NA)), class = "data.frame", row.names = c(NA, 
-18L))

I would like to compute the daily sum of the Size for each Category. So for instance, for 01.09.2018 I would like to have the total size of all observations on 01.08.2018 belonging to the same category. The difficulty for me is that I have also observations where the category is NA.

So far I created the follwing code:

Data <- Data %>%
  group_by(Category, Date) %>%
  dplyr:: mutate(Sum_Size = sum(Size))

The problem is that I also have now the sum of the size for observations and days without a category (NA). R treats the category where there is no "real" category (NA) as an own category. However, I would like that R ignroes the "NA Category" and does not calculate the sum of the size for each day. So for instance, we can write "not applicable" in the rows where this would apply.

Could anyone help me here with the code?

Thank you very much already!

CodePudding user response:

EDIT: OP wants to keep the data that is Category == NA, so maybe this solution?

data_noNA <- data %>%
  group_by(Category, Date) %>%
  dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
  filter(!is.na(Category)) %>%
  # add back in info from missing columns after summarize
  left_join(data, by = c("Category", "Date"))

data2 <- bind_rows(data_noNA, data %>% filter(is.na(Category))); data2
# A tibble: 18 x 5
# Groups:   Category [5]
   Category Date       Sum_Size Name   Size
      <int> <chr>         <int> <chr> <int>
 1        1 01.09.2018       34 A        34
 2        1 02.09.2018       23 A        23
 3        2 02.09.2018       23 C        23
 4        2 05.11.2021       12 C        12
 5        2 06.11.2021       35 A        23
 6        2 06.11.2021       35 C        12
 7        2 07.11.2021       53 A        53
 8        3 01.09.2018       23 B        23
 9        3 02.09.2018       54 B        54
10        3 03.09.2018       65 B        65
11        4 01.09.2018       45 C        45
12        4 07.11.2021       45 B        45
13       NA 03.09.2018       NA A        12
14       NA 05.11.2021       NA A        53
15       NA 05.11.2021       NA B        75
16       NA 06.11.2021       NA B        67
17       NA 03.09.2018       NA C        23
18       NA 07.11.2021       NA C        NA

Something like this?

library(tidyverse)
data <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B", "B", 
                        "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018", 
                                                                                    "02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", "07.11.2021", 
                                                                                    "01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", 
                                                                                    "07.11.2021", "01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", 
                                                                                    "06.11.2021", "07.11.2021"), Category = c(1L, 1L, NA, NA, 2L, 
                                                                                                                              2L, 3L, 3L, 3L, NA, NA, 4L, 4L, 2L, NA, 2L, 2L, NA), Size = c(34L, 
                                                                                                                                                                                            23L, 12L, 53L, 23L, 53L, 23L, 54L, 65L, 75L, 67L, 45L, 45L, 23L, 
                                                                                                                                                                                            23L, 12L, 12L, NA)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                     -18L))
data2 <- data %>%
  group_by(Category, Date) %>%
  dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
  filter(!is.na(Category)); data2
#> `summarise()` has grouped output by 'Category'. You can override using the
#> `.groups` argument.
#> # A tibble: 11 x 3
#> # Groups:   Category [4]
#>    Category Date       Sum_Size
#>       <int> <chr>         <int>
#>  1        1 01.09.2018       34
#>  2        1 02.09.2018       23
#>  3        2 02.09.2018       23
#>  4        2 05.11.2021       12
#>  5        2 06.11.2021       35
#>  6        2 07.11.2021       53
#>  7        3 01.09.2018       23
#>  8        3 02.09.2018       54
#>  9        3 03.09.2018       65
#> 10        4 01.09.2018       45
#> 11        4 07.11.2021       45

Created on 2022-04-16 by the reprex package (v2.0.1)

  • Related