Home > other >  How to find start and end dates, and total days recorded for each column in a dataframe in R?
How to find start and end dates, and total days recorded for each column in a dataframe in R?

Time:04-07

I am trying to provide some summary information on a dataframe I have in R. I want to know the Start Date/Time, End Date/Time, and Total Days values were recorded. Here is an examples of the dataframe

df = structure(list(Date_Time_GMT_3 = structure(c(1594233000, 1594533900, 1597235700,
                                                  1595234800, 1594336600, 1595237500), 
                                                class = c("POSIXct",  "POSIXt"), tzone = "EST"),
                    `20874285_33MR` = c(14.996, 15.091, 15.187, 15.282, 15.378, 15.378), 
                    `20874290_103MR` = c(NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), 
                    `20874287_102MR` = c(NA_real_, 15.091, 15.187, 15.282, NA_real_, NA_real_), 
                    `20874299_54MR` = c(NA_real_, 15.378, 15.378, NA_real_, NA_real_, NA_real_), 
                    `20874316_AIR_90MR` = c(NA_real_,  NA_real_, NA_real_,15.091, 15.187, 15.282)), 
               row.names = c(NA, 6L), class = "data.frame")

I have this code which tells me the total days each column had values (not including the N/A's)

library(dplyr)

df %>%
  group_by(date = as.Date(Date_Time_GMT_3)) %>%
  summarise(across(everything(), ~any(!is.na(.)))) %>%
  summarise(across(-date, sum))

and I have this code to find the Start and End date/times for each column (where the NA's don't matter, therefore it's the start Date from the first record to the End date of the last record).

df_MetadataStart = df %>%
  pivot_longer(-c(Date_Time_GMT_3)) %>%
  select(name, Date_Time_GMT_3) %>%
  group_by(name, col = rep(c('StartTime', 'EndTime'), length.out = n())) %>%
  mutate(id = row_number()) %>%
  tidyr::pivot_wider(names_from = col, values_from = Date_Time_GMT_3) %>%
  ungroup() %>%
  select(-id)

But there are 2 issues with these codes. First, the code I have for the Start/End Dates/Times ends up having multiple inputs for 1 column. Second, with these 2 seperate codes, I am not getting the results I want. Ultimately, I want the final dataframe too look like so

Name       Start Date   End Date Total Days 
<chr>         <Pos>      <Pos>     <int>    

CodePudding user response:

Using a helper function

myrleid <- function(x) {
  r <- rle(x)
  rep(seq_along(r$lengths), times = r$lengths)
}

We can group by name and then by runs of non-NA values to produce this:

library(dplyr)
library(tidyr) # pivot_longer
df %>%
  pivot_longer(-Date_Time_GMT_3) %>%
  arrange(Date_Time_GMT_3) %>%
  group_by(name) %>%
  mutate(grp = myrleid(is.na(value))) %>%
  group_by(name, grp) %>%
  summarize(
    Start = min(Date_Time_GMT_3),
    End = max(Date_Time_GMT_3),
    TotalDays = as.numeric(max(Date_Time_GMT_3) - min(Date_Time_GMT_3), units = "days"),
    value1 = value[1]) %>%
  ungroup() %>%
  filter(!is.na(value1)) %>%
  select(-grp, -value1)
# # A tibble: 7 x 4
#   name              Start               End                 TotalDays
#   <chr>             <dttm>              <dttm>                  <dbl>
# 1 20874285_33MR     2020-07-08 13:30:00 2020-08-12 07:35:00   34.8   
# 2 20874287_102MR    2020-07-12 01:05:00 2020-07-20 03:46:40    8.11  
# 3 20874287_102MR    2020-08-12 07:35:00 2020-08-12 07:35:00    0     
# 4 20874299_54MR     2020-07-12 01:05:00 2020-07-12 01:05:00    0     
# 5 20874299_54MR     2020-08-12 07:35:00 2020-08-12 07:35:00    0     
# 6 20874316_AIR_90MR 2020-07-09 18:16:40 2020-07-09 18:16:40    0     
# 7 20874316_AIR_90MR 2020-07-20 03:46:40 2020-07-20 04:31:40    0.0312

FYI: the two rows for 102MR and others are caused by the timestamps being out of order: I'm inferring they should be ordered before grouping by non-NA-clusters, and arranging by the timestamp produces two clusters within those name groups.

  • Related