Home > Software design >  How to monthly summarise daily data, using dplyr, only if less than 10 days per month are NAs?
How to monthly summarise daily data, using dplyr, only if less than 10 days per month are NAs?

Time:10-28

I have a dataframe with daily meteorological data from a site and I want to summarize them in monthly averages but only if the number of NAs in each month is less than 10.

Data sample: exmpldf

DATE TMAX
1955-06-01 NA
1955-06-02 NA
1955-06-03 NA
1955-06-04 NA
1955-06-05 NA
1955-06-06 NA
1955-06-07 NA
1955-06-08 NA
1955-06-09 20
1955-06-10 23
1955-06-11 23
1955-06-12 21.5
1955-06-13 24
1955-06-14 26.5
1955-06-15 27
1955-06-16 27

For the monthly aggregation I used mutate from dplyr library to create a column "MONTH" and a column "YEAR"

exmpldf <- exmpldf %>%
  mutate(month(DATE), year(DATE))
names(exmpldf) <- c("DATE", "TMAX", "MONTH", "YEAR")
DATE TMAX MONTH YEAR
1955-06-01 NA 6 1955
1955-06-02 NA 6 1955
1955-06-03 NA 6 1955
1955-06-04 NA 6 1955
1955-06-05 NA 6 1955
1955-06-06 NA 6 1955
1955-06-07 NA 6 1955
1955-06-08 NA 6 1955
1955-06-09 20 6 1955
1955-06-10 23 6 1955
1955-06-11 23 6 1955
1955-06-12 21.5 6 1955
1955-06-13 24 6 1955
1955-06-14 26.5 6 1955
1955-06-15 27 6 1955
1955-06-16 27 6 1955

To create the monthly average I used summarize, also from dplyr:

exmpldfmeanMonth <- exmpldf %>%
  group_by(MONTH, YEAR) %>%
  summarise(TMAX = mean(TMAX))

The problem is that there are many months that have at least 1 NA and others have 25 or 30 NAs, in any case, the mean result is NA. You could add na.rm = T but then it calculates the mean even if there is only one data per month.

So I would like to generate a conditional that calculates the average only if there are 10 or less NAs per month, otherwise it should be considered as NA.

I tried without success:

exmpldfmeanMonth <- exmpldf %>%
  group_by(MONTH, YEAR) %>%
  summarise_if(is.na(exmpldf$TMAX)<=10, PRCP = mean(TMAX, na.rm = T), NA)

and:

exmpldfmeanMonth <- exmpldf %>%
  group_by(MONTH, YEAR) %>%
  summarise(PRCP = ifelse(is.na(exmpldf$TMAX)<=10, mean(TMAX, na.rm = T), NA))

Could you please guide me on how to solve this? Thank you very much in advance!

CodePudding user response:

Please, find one alternative based on your approach using the packages data.table and lubridate:

Reprex

  • Code
library(data.table)
library(lubridate)

setDT(df1)[, DATE := as_date(DATE)][,  c("month", "year") := c(lapply(.SD, month), lapply(.SD, year)), .SDcols = c("DATE")
                                    ][, .(PRCP = fifelse(sum(is.na(TMAX)) <= 10, mean(TMAX, na.rm = TRUE), NA_real_)), by = .(month, year)][]
  • Case 1: NA <= 10

1.1 Your data:

df1 <- data.frame(DATE = c("1955-06-01", "1955-06-02", "1955-06-03", "1955-06-04",
                           "1955-06-05", "1955-06-06", "1955-06-07", "1955-06-08",
                           "1955-06-09", "1955-06-10", "1955-06-11", "1955-06-12",
                           "1955-06-13", "1955-06-14", "1955-06-15", "1955-06-16"),
                  TMAX = c(NA, NA, NA, NA, NA, NA, NA, NA, 20, 23, 23, 21.5, 24, 26.5,
                           27, 27))

2.2 Output:

#>    month year PRCP
#> 1:     6 1955   24
  • Case 2: NA > 10

2.1 Your data:

df1 <- data.frame(DATE = c("1955-06-01", "1955-06-02", "1955-06-03", "1955-06-04",
                           "1955-06-05", "1955-06-06", "1955-06-07", "1955-06-08",
                           "1955-06-09", "1955-06-10", "1955-06-11", "1955-06-12",
                           "1955-06-13", "1955-06-14", "1955-06-15", "1955-06-16"),
                  TMAX = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 21.5, 24, 26.5,
                           27, 27))

2.2 Output:

#>    month year PRCP
#> 1:     6 1955   NA

Created on 2021-10-28 by the reprex package (v0.3.0)

CodePudding user response:

library(dplyr)
library(lubridate)

df %>% 
  mutate(month = month(DATE),
         year = year(DATE)) %>% 
  group_by(month, year) %>% 
  summarize(prcp = if (sum(is.na(TMAX)) <= 10) mean(TMAX, na.rm = T) else NA,
            .groups = "drop")

Alternatively, when you summarize you could count the number of NA and then add a mutate statement to conditionally change prcp:

df %>% 
  mutate(month = month(DATE),
         year = year(DATE)) %>% 
  group_by(month, year) %>% 
  summarize(prcp = mean(TMAX, na.rm = T),
            numna = sum(is.na(TMAX)), # count number of NA
            .groups = "drop") %>% 
  mutate(prcp = ifelse(numna > 10, NA, prcp)) %>% 
  select(-numna)

Output

In the data you've shown there is only one month and year combination and that grouping has more than 10 NA:

  month  year prcp 
1     6  1955 NA   
  • Related