I have the following data:
structure(list(Name = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018",
"02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", "07.11.2021",
"01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021",
"07.11.2021", "01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021",
"06.11.2021", "07.11.2021"), Category = c(1L, 1L, NA, NA, 2L,
2L, 3L, 3L, 3L, NA, NA, 4L, 4L, 2L, NA, 2L, 2L, NA), Size = c(34L,
23L, 12L, 53L, 23L, 53L, 23L, 54L, 65L, 75L, 67L, 45L, 45L, 23L,
23L, 12L, 12L, NA)), class = "data.frame", row.names = c(NA,
-18L))
I would like to compute the daily sum of the Size for each Category. So for instance, for 01.09.2018 I would like to have the total size of all observations on 01.08.2018 belonging to the same category. The difficulty for me is that I have also observations where the category is NA.
So far I created the follwing code:
Data <- Data %>%
group_by(Category, Date) %>%
dplyr:: mutate(Sum_Size = sum(Size))
The problem is that I also have now the sum of the size for observations and days without a category (NA). R treats the category where there is no "real" category (NA) as an own category. However, I would like that R ignroes the "NA Category" and does not calculate the sum of the size for each day. So for instance, we can write "not applicable" in the rows where this would apply.
Could anyone help me here with the code?
Thank you very much already!
CodePudding user response:
EDIT: OP wants to keep the data that is Category == NA, so maybe this solution?
data_noNA <- data %>%
group_by(Category, Date) %>%
dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
filter(!is.na(Category)) %>%
# add back in info from missing columns after summarize
left_join(data, by = c("Category", "Date"))
data2 <- bind_rows(data_noNA, data %>% filter(is.na(Category))); data2
# A tibble: 18 x 5
# Groups: Category [5]
Category Date Sum_Size Name Size
<int> <chr> <int> <chr> <int>
1 1 01.09.2018 34 A 34
2 1 02.09.2018 23 A 23
3 2 02.09.2018 23 C 23
4 2 05.11.2021 12 C 12
5 2 06.11.2021 35 A 23
6 2 06.11.2021 35 C 12
7 2 07.11.2021 53 A 53
8 3 01.09.2018 23 B 23
9 3 02.09.2018 54 B 54
10 3 03.09.2018 65 B 65
11 4 01.09.2018 45 C 45
12 4 07.11.2021 45 B 45
13 NA 03.09.2018 NA A 12
14 NA 05.11.2021 NA A 53
15 NA 05.11.2021 NA B 75
16 NA 06.11.2021 NA B 67
17 NA 03.09.2018 NA C 23
18 NA 07.11.2021 NA C NA
Something like this?
library(tidyverse)
data <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "C", "C", "C", "C", "C", "C"), Date = c("01.09.2018",
"02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021", "07.11.2021",
"01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021", "06.11.2021",
"07.11.2021", "01.09.2018", "02.09.2018", "03.09.2018", "05.11.2021",
"06.11.2021", "07.11.2021"), Category = c(1L, 1L, NA, NA, 2L,
2L, 3L, 3L, 3L, NA, NA, 4L, 4L, 2L, NA, 2L, 2L, NA), Size = c(34L,
23L, 12L, 53L, 23L, 53L, 23L, 54L, 65L, 75L, 67L, 45L, 45L, 23L,
23L, 12L, 12L, NA)), class = "data.frame", row.names = c(NA,
-18L))
data2 <- data %>%
group_by(Category, Date) %>%
dplyr::summarize(Sum_Size = sum(Size, na.rm = TRUE)) %>%
filter(!is.na(Category)); data2
#> `summarise()` has grouped output by 'Category'. You can override using the
#> `.groups` argument.
#> # A tibble: 11 x 3
#> # Groups: Category [4]
#> Category Date Sum_Size
#> <int> <chr> <int>
#> 1 1 01.09.2018 34
#> 2 1 02.09.2018 23
#> 3 2 02.09.2018 23
#> 4 2 05.11.2021 12
#> 5 2 06.11.2021 35
#> 6 2 07.11.2021 53
#> 7 3 01.09.2018 23
#> 8 3 02.09.2018 54
#> 9 3 03.09.2018 65
#> 10 4 01.09.2018 45
#> 11 4 07.11.2021 45
Created on 2022-04-16 by the reprex package (v2.0.1)