Home > Mobile >  R: Data is in random minute intervals, need to average it to hourly
R: Data is in random minute intervals, need to average it to hourly

Time:01-17

I am super new to R and programming as a whole.

I have a data set with over 2 million rows that I have split into 3 separate CSV files.

Currently the CSVs look like this (I removed some rows for simplicity):

    Date        Time        Elevation
1   2011-01-01  0:00:00     3.532   
2   2011-01-01  0:15:00     3.538   
3   2011-01-01  0:30:00     3.541   
4   2011-01-01  0:45:00     3.546   
5   2011-01-01  1:00:00     3.548   
6   2011-01-01  1:15:00     3.555   
7   2011-01-01  1:30:00     3.561   
8   2011-01-01  1:45:00     3.566
9   2011-01-01  2:00:00     3.568

......

     Date        Time        Elevation
484985  2018-04-21  0:00:00     3.15264 
484986  2018-04-21  0:05:00     3.15164 
484987  2018-04-21  0:10:00     3.14964     
484988  2018-04-21  0:15:00     3.14563 
484989  2018-04-21  0:20:00     3.14263
484990  2018-04-21  0:25:00     3.13962 

The time intervals are not consistent with some data being measured at 15 minute intervals, and some at 5 minute intervals. I would like to get an output like this:

       Date        Time        Elevation
1      2011-01-01  0:00:00          
2      2011-01-01  1:00:00        3.5393
3      2011-01-01  2:00:00        3.5575
4      2011-01-01  3:00:00        3.5788

I tried this from an old post but it didn't work it only took the average of the entire day instead of by hour:

library(lubridate)
df[,2] <- ymd_hms(df[,2])

df %>% mutate(hour = hour(df[,2])) %>% 
  group_by(hour) %>% summarise_at(vars(Elevation), mean)

And I currently have this which splits the time into separate columns for hour, minute and second:

library(readr)
library(tidyverse)

read_csv <- read.csv('dataset1.csv') %>% 
  mutate(date_time = ymd_hms(paste0(Date, " ", Time)),)

df1 <- read_csv %>% 
  separate(Time, into = c("Hour", "Min", "sec"), sep = ":", remove = F

I have spent a long time trying to figure it out and have gone through similar posts here but so far nothing has worked. Any help would be appreciated!

CodePudding user response:

If you are fine with a a simple mean of the elevation then cutting the date/time columns in 15 minute intervals and then the group_by function can handle this:

df <- structure(list(Date = c("2011-01-01", "2011-01-01", "2011-01-01", 
                              "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", 
                              "2011-01-01", "2018-04-21", "2018-04-21", "2018-04-21", "2018-04-21", 
                              "2018-04-21", "2018-04-21"), Time = c("0:00:00", "0:15:00", "0:30:00", 
                                                                    "0:45:00", "1:00:00", "1:15:00", "1:30:00", "1:45:00", "2:00:00", 
                                                                    "0:00:00", "0:05:00", "0:10:00", "0:15:00", "0:20:00", "0:25:00"
                              ), Elevation = c(3.532, 3.538, 3.541, 3.546, 3.548, 3.555, 3.561, 
                                               3.566, 3.568, 3.15264, 3.15164, 3.14964, 3.14563, 3.14263, 3.13962
                              )), class = "data.frame", row.names = c(NA, -15L))

#create the datetime column
df$datetime <- paste(df$Date, df$Time)
df$datetime <- as.POSIXct(df$datetime)

#cut(df$datetime, "1 hour")

#group and take the mean
library(dplyr)
df %>% group_by(cut(df$datetime, "1 hour")) %>%
   summarize(Elevation = mean(Elevation))

At 2 million rows this will take some time but shouldn't be too long.

CodePudding user response:

We may use floor_date

library(dplyr)
library(lubridate)
library(tidyr)
read_csv %>%
   group_by(grp = floor_date(date_time, "hour")) %>%
   summarise(Elevation = mean(Elevation, na.rm = TRUE)) %>%
   separate(grp, into = c("Date", "Time"), sep = " ") 

CodePudding user response:

library(tibble)
library(dplyr)
library(lubridate)

tibble::tribble(
         ~Date,      ~Time, ~Elevation,
  "2011-01-01",  "0:00:00",  3.532,
  "2011-01-01",  "0:15:00",  3.538,
  "2011-01-01",  "0:30:00",  3.541,
  "2011-01-01",  "0:45:00",  3.546,
  "2011-01-01",  "1:00:00",  3.548,
  "2011-01-01",  "1:15:00",  3.555,
  "2011-01-01",  "1:30:00",  3.561,
  "2011-01-01",  "1:45:00",  3.566,
  "2011-01-01",  "2:00:00",  3.568
  ) %>% 
  dplyr::mutate(
    Date = lubridate::date(Date),
    Time = lubridate::hms(Time),
    Hour = lubridate::hour(Time)) %>% 
  dplyr::group_by(Date, Hour) %>% 
  dplyr::summarise(Elevation = mean(Elevation, na.rm = TRUE))

# A tibble: 3 × 3
# Groups:   Date [1]
  Date        Hour Elevation
  <date>     <dbl>     <dbl>
1 2011-01-01     0      3.54
2 2011-01-01     1      3.56
3 2011-01-01     2      3.57
  • Related