I have a dataset with a series of dates listed for each variable. I want to calculate the average number of days for each variable. Ideally, I would get a numeric value that would have the average.
For example, VarA would be 1 because there is an average of one day between each value. Each variable has the dates in order (ex. VarA has the most recent date last).
How would I do this in R?
Example of my dataset:
Variable | Date1 | Date2 | Date3 | Date 4 |
---|---|---|---|---|
VarA | 09/01/21 | 09/02/21 | 09/03/21 | 09/04/21 |
VarB | 08/01/21 | 08/17/21 | 09/02/21 | |
VarC | 09/25/21 |
CodePudding user response:
I have a package on github which makes it easy to apply functions on pairs of variables.
In this case we can use dplyover::across2
to get the difference between Date2-Date1
, Date3-Date2
and Date4-Date3
. Then all we need is rowMeans
in which we wrap the call to across2
.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover/
# approach
dat %>%
mutate(diff = rowMeans(
across2(Date1:Date3, # .x
Date2:Date4, # .y
~ as.numeric(.y - .x)),
na.rm = TRUE)
)
# result
#> # A tibble: 3 x 6
#> Variable Date1 Date2 Date3 Date4 diff
#> <chr> <date> <date> <date> <date> <dbl>
#> 1 VarA 2021-09-01 2021-09-02 2021-09-03 2021-09-04 1
#> 2 VarB 2021-08-01 2021-08-17 2021-09-02 NA 16
#> 3 VarC 2021-09-25 NA NA NA NaN
# the preprocessed data
dat <- read.table(header = TRUE, text =
"Variable Date1 Date2 Date3 Date4
VarA '09/01/21' '09/02/21' '09/03/21' '09/04/21'
VarB '08/01/21' '08/17/21' '09/02/21' ''
VarC '09/25/21' '' '' '' ")
dat <- dat %>% as_tibble %>%
mutate(across(starts_with("Date"),
~ lubridate::mdy(.x)))
Created on 2021-09-25 by the reprex package (v0.3.0)
Alternatively, we can stick with only {dplyr} by subtracting two across
calls from each other and wrapping them in rowMeans
.
library(dplyr)
# approach
dat %>%
mutate(diff = rowMeans(
across(Date2:Date4, as.numeric) - across(Date1:Date3, as.numeric),
na.rm = TRUE))
# result
#> # A tibble: 3 x 6
#> Variable Date1 Date2 Date3 Date4 diff
#> <chr> <date> <date> <date> <date> <dbl>
#> 1 VarA 2021-09-01 2021-09-02 2021-09-03 2021-09-04 1
#> 2 VarB 2021-08-01 2021-08-17 2021-09-02 NA 16
#> 3 VarC 2021-09-25 NA NA NA NaN
Created on 2021-09-25 by the reprex package (v0.3.0)
CodePudding user response:
You can take rowwise differences of 'Date' columns and take the average.
library(dplyr)
df %>%
mutate(across(starts_with('Date'), as.Date, '%m/%d/%y')) %>%
rowwise() %>%
mutate(diff = as.numeric(mean(diff(c_across(starts_with('Date'))), na.rm = TRUE)))
# Variable Date1 Date2 Date3 Date4 diff
# <chr> <date> <date> <date> <date> <dbl>
#1 VarA 2021-09-01 2021-09-02 2021-09-03 2021-09-04 1
#2 VarB 2021-08-01 2021-08-17 2021-09-02 NA 16
#3 VarC 2021-09-25 NA NA NA NaN
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Variable = c("VarA", "VarB", "VarC"), Date1 = c("09/01/21",
"08/01/21", "09/25/21"), Date2 = c("09/02/21", "08/17/21", ""
), Date3 = c("09/03/21", "09/02/21", ""), Date4 = c("09/04/21",
"", "")), class = "data.frame", row.names = c(NA, -3L))