I am trying to provide some summary information on a dataframe I have in R. I want to know the Start Date/Time
, End Date/Time
, and Total Days values were recorded
. Here is an examples of the dataframe
df = structure(list(Date_Time_GMT_3 = structure(c(1594233000, 1594533900, 1597235700,
1595234800, 1594336600, 1595237500),
class = c("POSIXct", "POSIXt"), tzone = "EST"),
`20874285_33MR` = c(14.996, 15.091, 15.187, 15.282, 15.378, 15.378),
`20874290_103MR` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
`20874287_102MR` = c(NA_real_, 15.091, 15.187, 15.282, NA_real_, NA_real_),
`20874299_54MR` = c(NA_real_, 15.378, 15.378, NA_real_, NA_real_, NA_real_),
`20874316_AIR_90MR` = c(NA_real_, NA_real_, NA_real_,15.091, 15.187, 15.282)),
row.names = c(NA, 6L), class = "data.frame")
I have this code which tells me the total days each column had values (not including the N/A's)
library(dplyr)
df %>%
group_by(date = as.Date(Date_Time_GMT_3)) %>%
summarise(across(everything(), ~any(!is.na(.)))) %>%
summarise(across(-date, sum))
and I have this code to find the Start
and End
date/times for each column (where the NA's don't matter, therefore it's the start Date from the first record to the End date of the last record).
df_MetadataStart = df %>%
pivot_longer(-c(Date_Time_GMT_3)) %>%
select(name, Date_Time_GMT_3) %>%
group_by(name, col = rep(c('StartTime', 'EndTime'), length.out = n())) %>%
mutate(id = row_number()) %>%
tidyr::pivot_wider(names_from = col, values_from = Date_Time_GMT_3) %>%
ungroup() %>%
select(-id)
But there are 2 issues with these codes. First, the code I have for the Start/End Dates/Times
ends up having multiple inputs for 1 column. Second, with these 2 seperate codes, I am not getting the results I want. Ultimately, I want the final dataframe too look like so
Name Start Date End Date Total Days
<chr> <Pos> <Pos> <int>
CodePudding user response:
Using a helper function
myrleid <- function(x) {
r <- rle(x)
rep(seq_along(r$lengths), times = r$lengths)
}
We can group by name
and then by runs of non-NA
values to produce this:
library(dplyr)
library(tidyr) # pivot_longer
df %>%
pivot_longer(-Date_Time_GMT_3) %>%
arrange(Date_Time_GMT_3) %>%
group_by(name) %>%
mutate(grp = myrleid(is.na(value))) %>%
group_by(name, grp) %>%
summarize(
Start = min(Date_Time_GMT_3),
End = max(Date_Time_GMT_3),
TotalDays = as.numeric(max(Date_Time_GMT_3) - min(Date_Time_GMT_3), units = "days"),
value1 = value[1]) %>%
ungroup() %>%
filter(!is.na(value1)) %>%
select(-grp, -value1)
# # A tibble: 7 x 4
# name Start End TotalDays
# <chr> <dttm> <dttm> <dbl>
# 1 20874285_33MR 2020-07-08 13:30:00 2020-08-12 07:35:00 34.8
# 2 20874287_102MR 2020-07-12 01:05:00 2020-07-20 03:46:40 8.11
# 3 20874287_102MR 2020-08-12 07:35:00 2020-08-12 07:35:00 0
# 4 20874299_54MR 2020-07-12 01:05:00 2020-07-12 01:05:00 0
# 5 20874299_54MR 2020-08-12 07:35:00 2020-08-12 07:35:00 0
# 6 20874316_AIR_90MR 2020-07-09 18:16:40 2020-07-09 18:16:40 0
# 7 20874316_AIR_90MR 2020-07-20 03:46:40 2020-07-20 04:31:40 0.0312
FYI: the two rows for 102MR
and others are caused by the timestamps being out of order: I'm inferring they should be ordered before grouping by non-NA
-clusters, and arranging by the timestamp produces two clusters within those name
groups.