I am super new to R and programming as a whole.
I have a data set with over 2 million rows that I have split into 3 separate CSV files.
Currently the CSVs look like this (I removed some rows for simplicity):
Date Time Elevation
1 2011-01-01 0:00:00 3.532
2 2011-01-01 0:15:00 3.538
3 2011-01-01 0:30:00 3.541
4 2011-01-01 0:45:00 3.546
5 2011-01-01 1:00:00 3.548
6 2011-01-01 1:15:00 3.555
7 2011-01-01 1:30:00 3.561
8 2011-01-01 1:45:00 3.566
9 2011-01-01 2:00:00 3.568
......
Date Time Elevation
484985 2018-04-21 0:00:00 3.15264
484986 2018-04-21 0:05:00 3.15164
484987 2018-04-21 0:10:00 3.14964
484988 2018-04-21 0:15:00 3.14563
484989 2018-04-21 0:20:00 3.14263
484990 2018-04-21 0:25:00 3.13962
The time intervals are not consistent with some data being measured at 15 minute intervals, and some at 5 minute intervals. I would like to get an output like this:
Date Time Elevation
1 2011-01-01 0:00:00
2 2011-01-01 1:00:00 3.5393
3 2011-01-01 2:00:00 3.5575
4 2011-01-01 3:00:00 3.5788
I tried this from an old post but it didn't work it only took the average of the entire day instead of by hour:
library(lubridate)
df[,2] <- ymd_hms(df[,2])
df %>% mutate(hour = hour(df[,2])) %>%
group_by(hour) %>% summarise_at(vars(Elevation), mean)
And I currently have this which splits the time into separate columns for hour, minute and second:
library(readr)
library(tidyverse)
read_csv <- read.csv('dataset1.csv') %>%
mutate(date_time = ymd_hms(paste0(Date, " ", Time)),)
df1 <- read_csv %>%
separate(Time, into = c("Hour", "Min", "sec"), sep = ":", remove = F
I have spent a long time trying to figure it out and have gone through similar posts here but so far nothing has worked. Any help would be appreciated!
CodePudding user response:
If you are fine with a a simple mean of the elevation then cutting the date/time columns in 15 minute intervals and then the group_by function can handle this:
df <- structure(list(Date = c("2011-01-01", "2011-01-01", "2011-01-01",
"2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01",
"2011-01-01", "2018-04-21", "2018-04-21", "2018-04-21", "2018-04-21",
"2018-04-21", "2018-04-21"), Time = c("0:00:00", "0:15:00", "0:30:00",
"0:45:00", "1:00:00", "1:15:00", "1:30:00", "1:45:00", "2:00:00",
"0:00:00", "0:05:00", "0:10:00", "0:15:00", "0:20:00", "0:25:00"
), Elevation = c(3.532, 3.538, 3.541, 3.546, 3.548, 3.555, 3.561,
3.566, 3.568, 3.15264, 3.15164, 3.14964, 3.14563, 3.14263, 3.13962
)), class = "data.frame", row.names = c(NA, -15L))
#create the datetime column
df$datetime <- paste(df$Date, df$Time)
df$datetime <- as.POSIXct(df$datetime)
#cut(df$datetime, "1 hour")
#group and take the mean
library(dplyr)
df %>% group_by(cut(df$datetime, "1 hour")) %>%
summarize(Elevation = mean(Elevation))
At 2 million rows this will take some time but shouldn't be too long.
CodePudding user response:
We may use floor_date
library(dplyr)
library(lubridate)
library(tidyr)
read_csv %>%
group_by(grp = floor_date(date_time, "hour")) %>%
summarise(Elevation = mean(Elevation, na.rm = TRUE)) %>%
separate(grp, into = c("Date", "Time"), sep = " ")
CodePudding user response:
library(tibble)
library(dplyr)
library(lubridate)
tibble::tribble(
~Date, ~Time, ~Elevation,
"2011-01-01", "0:00:00", 3.532,
"2011-01-01", "0:15:00", 3.538,
"2011-01-01", "0:30:00", 3.541,
"2011-01-01", "0:45:00", 3.546,
"2011-01-01", "1:00:00", 3.548,
"2011-01-01", "1:15:00", 3.555,
"2011-01-01", "1:30:00", 3.561,
"2011-01-01", "1:45:00", 3.566,
"2011-01-01", "2:00:00", 3.568
) %>%
dplyr::mutate(
Date = lubridate::date(Date),
Time = lubridate::hms(Time),
Hour = lubridate::hour(Time)) %>%
dplyr::group_by(Date, Hour) %>%
dplyr::summarise(Elevation = mean(Elevation, na.rm = TRUE))
# A tibble: 3 × 3
# Groups: Date [1]
Date Hour Elevation
<date> <dbl> <dbl>
1 2011-01-01 0 3.54
2 2011-01-01 1 3.56
3 2011-01-01 2 3.57