Home > Net >  Fill empty data frame based on multiple criteria in R (equiavalent to SUMIFS in Excel)
Fill empty data frame based on multiple criteria in R (equiavalent to SUMIFS in Excel)

Time:08-21

I am fairly new to R and happen to be stuck with the following problem: I want to fill in an empty data frame based on multiple criteria and sum the variable of interest. The data frames look like this:

The empty matrix, that I want to fill:

dat <- data.frame(year = 2021, 
                  month = 1:12,
                  green = "N/A",
                  yellow = "N/A",
                  red = "N/A",
                  blue = "N/A",
                  purple = "N/A")

dat

   year month green yellow red blue purple
1  2021     1   N/A    N/A N/A  N/A    N/A
2  2021     2   N/A    N/A N/A  N/A    N/A
3  2021     3   N/A    N/A N/A  N/A    N/A
4  2021     4   N/A    N/A N/A  N/A    N/A
5  2021     5   N/A    N/A N/A  N/A    N/A
6  2021     6   N/A    N/A N/A  N/A    N/A
7  2021     7   N/A    N/A N/A  N/A    N/A
8  2021     8   N/A    N/A N/A  N/A    N/A
9  2021     9   N/A    N/A N/A  N/A    N/A
10 2021    10   N/A    N/A N/A  N/A    N/A
11 2021    11   N/A    N/A N/A  N/A    N/A
12 2021    12   N/A    N/A N/A  N/A    N/A

And data frame with the original data:

dat_org <- data.frame(year = c(2020,2020,2020,2020,2020,2021,2021,2021,2021,2021,2021,2021,2021), 
                      month = c(1,4,6,9,12,1,1,1,5,5,5,10,10),
                      height = c(23,35,76,87,55,43,21,12,45,67,78,23,34),
                      colour = c("green", "green", "red", "yellow", "purple", "green", "blue", "blue", "blue", "red", "yellow", "yellow", "red"))

dat_org

   year month height colour
1  2020     1     23  green
2  2020     4     35  green
3  2020     6     76    red
4  2020     9     87 yellow
5  2020    12     55 purple
6  2021     1     43  green
7  2021     1     21   blue
8  2021     1     12   blue
9  2021     5     45   blue
10 2021     5     67    red
11 2021     5     78 yellow
12 2021    10     23 yellow
13 2021    10     34    red

Now, I want to transfer the "height" values in dat_org to dat, based on the year, month and colour. If no values in dat_org based on these criteria exist, I want to replace N/A with 0. If there are several rows, that meet the conditions, I want to sum up the variable (as it happend for year = 2021, month = 1 and colour ="blue").

The result would look like this:

dat

   year month green yellow red blue purple
1  2021     1    43      0   0   33      0
2  2021     2     0      0   0    0      0
3  2021     3     0      0   0    0      0
4  2021     4     0      0   0    0      0
5  2021     5     0     78  67   45      0
6  2021     6     0      0   0    0      0
7  2021     7     0      0   0    0      0
8  2021     8     0      0   0    0      0
9  2021     9     0      0   0    0      0
10 2021    10     0     23  34    0      0
11 2021    11     0      0   0    0      0
12 2021    12     0      0   0    0      0

I was first thinking to use "aggregate" or "group by", but then it only lists the existing months of df_org (here 1,5,10), but I also want to fill in N/A in df with a zero, if the conditions are not fulfilled.

The logic is similar to sumifs in Excel, but I am not sure how to apply this in R.

Thanks in advance for your help!

CodePudding user response:

@stefan, thanks for the quick response. If've tried your suggestion and it gives me the following results:

dat_org %>%
  group_by(year, month, colour) %>% summarise(height = sum(height), .groups = "drop") %>% 
  pivot_wider(names_from = colour, values_from = height, values_fill = 0)
  
   year month green   red yellow purple  blue
  <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1  2020     1    23     0      0      0     0
2  2020     4    35     0      0      0     0
3  2020     6     0    76      0      0     0
4  2020     9     0     0     87      0     0
5  2020    12     0     0      0     55     0
6  2021     1    43     0      0      0    33
7  2021     5     0    67     78      0    45
8  2021    10     0    34     23      0     0

Now, what is still missing, are the months, where no values are given in dat_org, such as month 2,3,4,6,7,8,9,11,12 in 2021.

CodePudding user response:

This is basically the code from my comment which first aggregates the data and after that reshape to wide format. However, I added two pieces. First, I use tidyr::complete to add all months from 1 to 12 and finally I filter for year 2021

dat_org <- data.frame(year = c(2020,2020,2020,2020,2020,2021,2021,2021,2021,2021,2021,2021,2021), 
                      month = c(1,4,6,9,12,1,1,1,5,5,5,10,10),
                      height = c(23,35,76,87,55,43,21,12,45,67,78,23,34),
                      colour = c("green", "green", "red", "yellow", "purple", "green", "blue", "blue", "blue", "red", "yellow", "yellow", "red"))

library(dplyr)
library(tidyr)

colors <- unique(dat_org$colour)

dat_org %>% 
  group_by(year, month, colour) %>% 
  summarise(height = sum(height), .groups = "drop") %>% 
  complete(year, month = 1:12, colour, fill = list(height = 0))  %>% 
  pivot_wider(names_from = colour, values_from = height, values_fill = 0) %>%
  filter(year == 2021)
#> # A tibble: 12 × 7
#>     year month  blue green purple   red yellow
#>    <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
#>  1  2021     1    33    43      0     0      0
#>  2  2021     2     0     0      0     0      0
#>  3  2021     3     0     0      0     0      0
#>  4  2021     4     0     0      0     0      0
#>  5  2021     5    45     0      0    67     78
#>  6  2021     6     0     0      0     0      0
#>  7  2021     7     0     0      0     0      0
#>  8  2021     8     0     0      0     0      0
#>  9  2021     9     0     0      0     0      0
#> 10  2021    10     0     0      0    34     23
#> 11  2021    11     0     0      0     0      0
#> 12  2021    12     0     0      0     0      0
  •  Tags:  
  • r
  • Related