Mean function R with missing values, loop on 5 rows-CodePudding

I would like to calculate mean every 5 rows in my df. Here is my df :

Time	value
03/06/2021 06:15:00	NA
03/06/2021 06:16:00	NA
03/06/2021 06:17:00	20
03/06/2021 06:18:00	22
03/06/2021 06:19:00	25
03/06/2021 06:20:00	NA
03/06/2021 06:21:00	31
03/06/2021 06:22:00	23
03/06/2021 06:23:00	19
03/06/2021 06:24:00	25
03/06/2021 06:25:00	34
03/06/2021 06:26:00	42
03/06/2021 06:27:00	NA
03/06/2021 06:28:00	19
03/06/2021 06:29:00	17
03/06/2021 06:30:00	25

I already have a loop which goes well to calculate means for each 5 rows package. My problem is in my "mean function".
The problem is :
-if I put na.rm = FALSE, mean = NA as soon as there is a NA in a package of 5 values.
- if I put na.rm = TRUE in mean function, the result gives me averages that are shifted to take 5 values.
I would like the NA not to interfere with the average and that when there is a NA in a package of 5 values, the average is only done on 4 values.
How can I do this? Thanks for your help !

CodePudding user response：

You can solve your problem by introducing a dummy variable that groups your observarions in sets of five and then calculating the mean within group. Here's MWE, based in the tidyverse, that assumes your data is in a data.frame named df.

library(tidyverse)

df %>% 
   mutate(Group= 1   floor((row_number()-1) / 5)) %>% 
   group_by(Group) %>% 
   summarise(Mean=mean(value, na.rm=TRUE), .groups="drop")
# A tibble: 4 × 2
  Group  Mean
  <dbl> <dbl>
1     1  22.3
2     2  24.5
3     3  28  
4     4  25

CodePudding user response：

A solution based on purrr::map_dfr:

library(purrr)

df <- data.frame(
  stringsAsFactors = FALSE,
                    time = c("03/06/2021 06:15:00","03/06/2021 06:16:00",
                             "03/06/2021 06:17:00",
                             "03/06/2021 06:18:00","03/06/2021 06:19:00",
                             "03/06/2021 06:20:00","03/06/2021 06:21:00",
                             "03/06/2021 06:22:00","03/06/2021 06:23:00",
                             "03/06/2021 06:24:00","03/06/2021 06:25:00",
                             "03/06/2021 06:26:00",
                             "03/06/2021 06:27:00","03/06/2021 06:28:00",
                             "03/06/2021 06:29:00","03/06/2021 06:30:00"),
                   value = c(NA,NA,20L,22L,
                             25L,NA,31L,23L,19L,25L,34L,42L,NA,19L,17L,
                             25L)
      )

map_dfr(1:(nrow(df)-5),
        ~ data.frame(Group =.x, Mean = mean(df$value[.x:(.x 5)],na.rm=T)))
#>    Group     Mean
#> 1      1 22.33333
#> 2      2 24.50000
#> 3      3 24.20000
#> 4      4 24.00000
#> 5      5 24.60000
#> 6      6 26.40000
#> 7      7 29.00000
#> 8      8 28.60000
#> 9      9 27.80000
#> 10    10 27.40000
#> 11    11 27.40000

CodePudding user response：

If you want to take average of every 5 minutes you may use lubridate's function floor_date/ceiling_date to round the time.

library(dplyr)
library(lubridate)

df %>%
  mutate(time = mdy_hms(time), 
         time = floor_date(time, '5 mins')) %>%
  group_by(time) %>%
  summarise(value = mean(value, na.rm = TRUE))

#  time                value
#  <dttm>              <dbl>
#1 2021-03-06 06:15:00  22.3
#2 2021-03-06 06:20:00  24.5
#3 2021-03-06 06:25:00  28  
#4 2021-03-06 06:30:00  25