R: Data is in random minute intervals, need to average it to hourly-CodePudding

I am super new to R and programming as a whole.

I have a data set with over 2 million rows that I have split into 3 separate CSV files.

Currently the CSVs look like this (I removed some rows for simplicity):

    Date        Time        Elevation
1   2011-01-01  0:00:00     3.532   
2   2011-01-01  0:15:00     3.538   
3   2011-01-01  0:30:00     3.541   
4   2011-01-01  0:45:00     3.546   
5   2011-01-01  1:00:00     3.548   
6   2011-01-01  1:15:00     3.555   
7   2011-01-01  1:30:00     3.561   
8   2011-01-01  1:45:00     3.566
9   2011-01-01  2:00:00     3.568

......

     Date        Time        Elevation
484985  2018-04-21  0:00:00     3.15264 
484986  2018-04-21  0:05:00     3.15164 
484987  2018-04-21  0:10:00     3.14964     
484988  2018-04-21  0:15:00     3.14563 
484989  2018-04-21  0:20:00     3.14263
484990  2018-04-21  0:25:00     3.13962

The time intervals are not consistent with some data being measured at 15 minute intervals, and some at 5 minute intervals. I would like to get an output like this:

       Date        Time        Elevation
1      2011-01-01  0:00:00          
2      2011-01-01  1:00:00        3.5393
3      2011-01-01  2:00:00        3.5575
4      2011-01-01  3:00:00        3.5788

I tried this from an old post but it didn't work it only took the average of the entire day instead of by hour:

library(lubridate)
df[,2] <- ymd_hms(df[,2])

df %>% mutate(hour = hour(df[,2])) %>% 
  group_by(hour) %>% summarise_at(vars(Elevation), mean)

And I currently have this which splits the time into separate columns for hour, minute and second:

library(readr)
library(tidyverse)

read_csv <- read.csv('dataset1.csv') %>% 
  mutate(date_time = ymd_hms(paste0(Date, " ", Time)),)

df1 <- read_csv %>% 
  separate(Time, into = c("Hour", "Min", "sec"), sep = ":", remove = F

I have spent a long time trying to figure it out and have gone through similar posts here but so far nothing has worked. Any help would be appreciated!

CodePudding user response：

If you are fine with a a simple mean of the elevation then cutting the date/time columns in 15 minute intervals and then the group_by function can handle this:

df <- structure(list(Date = c("2011-01-01", "2011-01-01", "2011-01-01", 
                              "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", 
                              "2011-01-01", "2018-04-21", "2018-04-21", "2018-04-21", "2018-04-21", 
                              "2018-04-21", "2018-04-21"), Time = c("0:00:00", "0:15:00", "0:30:00", 
                                                                    "0:45:00", "1:00:00", "1:15:00", "1:30:00", "1:45:00", "2:00:00", 
                                                                    "0:00:00", "0:05:00", "0:10:00", "0:15:00", "0:20:00", "0:25:00"
                              ), Elevation = c(3.532, 3.538, 3.541, 3.546, 3.548, 3.555, 3.561, 
                                               3.566, 3.568, 3.15264, 3.15164, 3.14964, 3.14563, 3.14263, 3.13962
                              )), class = "data.frame", row.names = c(NA, -15L))

#create the datetime column
df$datetime <- paste(df$Date, df$Time)
df$datetime <- as.POSIXct(df$datetime)

#cut(df$datetime, "1 hour")

#group and take the mean
library(dplyr)
df %>% group_by(cut(df$datetime, "1 hour")) %>%
   summarize(Elevation = mean(Elevation))

At 2 million rows this will take some time but shouldn't be too long.

CodePudding user response：

We may use floor_date

library(dplyr)
library(lubridate)
library(tidyr)
read_csv %>%
   group_by(grp = floor_date(date_time, "hour")) %>%
   summarise(Elevation = mean(Elevation, na.rm = TRUE)) %>%
   separate(grp, into = c("Date", "Time"), sep = " ")

CodePudding user response：

library(tibble)
library(dplyr)
library(lubridate)

tibble::tribble(
         ~Date,      ~Time, ~Elevation,
  "2011-01-01",  "0:00:00",  3.532,
  "2011-01-01",  "0:15:00",  3.538,
  "2011-01-01",  "0:30:00",  3.541,
  "2011-01-01",  "0:45:00",  3.546,
  "2011-01-01",  "1:00:00",  3.548,
  "2011-01-01",  "1:15:00",  3.555,
  "2011-01-01",  "1:30:00",  3.561,
  "2011-01-01",  "1:45:00",  3.566,
  "2011-01-01",  "2:00:00",  3.568
  ) %>% 
  dplyr::mutate(
    Date = lubridate::date(Date),
    Time = lubridate::hms(Time),
    Hour = lubridate::hour(Time)) %>% 
  dplyr::group_by(Date, Hour) %>% 
  dplyr::summarise(Elevation = mean(Elevation, na.rm = TRUE))

# A tibble: 3 × 3
# Groups:   Date [1]
  Date        Hour Elevation
  <date>     <dbl>     <dbl>
1 2011-01-01     0      3.54
2 2011-01-01     1      3.56
3 2011-01-01     2      3.57