I have a dataframe with two columns containing dates. I'd like to filter each sample id /variable/measurement_date combination based on the min difference greater than zero between measurement_date and date2. I've tried grouping the dataframe by sample id, variable and measurement_date, then filtering it by replacing all date differences less then zero with NA. Here is my code:
start_date <- as.Date('2021-01-01')
end_date <- as.Date('2021-12-30')
set.seed(1)
cheese <- data.frame(sample_id = c(1,1,1,1,1,1,1,1,1,1,1,1),
variable = c('a','a','a','a','a','a',
'b','b','b','b','b','b'),
value = runif(n = 12, min = 1, max = 10),
measurement_date = c('2021-06-01','2021-06-01','2021-06-01',
'2021-09-21','2021-09-21','2021-09-21',
'2021-06-01','2021-06-01','2021-06-01',
'2021-09-21','2021-09-21','2021-09-21'),
date2 = as.Date(sample( as.numeric(start_date): as.numeric(end_date), 12,
replace = T),
origin = '1970-01-01'))
library(dplyr)
filtered <- cheese %>% group_by(sample_id, variable, measurement_date)
filtered %>%
filter(difftime(measurement_date,date2) ==
min(replace(difftime(measurement_date, date2))>0,NA),na.rm=TRUE)
This is the sample dataframe:
sample_id variable value measurement_date date2
1 a 3.389578 2021-06-01 2021-03-26
1 a 4.349115 2021-06-01 2021-10-04
1 a 6.155680 2021-06-01 2021-12-28
1 a 9.173870 2021-09-21 2021-11-26
1 a 2.815137 2021-09-21 2021-09-20
1 a 9.085507 2021-09-21 2021-11-25
1 b 9.502077 2021-06-01 2021-03-20
1 b 6.947180 2021-06-01 2021-08-01
1 b 6.662026 2021-06-01 2021-02-06
1 b 1.556076 2021-09-21 2021-04-15
1 b 2.853771 2021-09-21 2021-08-05
1 b 2.589011 2021-09-21 2021-06-14
In this example, there are four sample_id / variable / measurement date combinations:
sample_id, variable, measurement_date
1,a,2021-06-01
1,b,2021-06-01
1,a,2021-09,21
1,b,2021-09-21
I'd like to group the dataframe by these combinations, then look at the difference between the measurement_date and date2, finding the minimum difference between the dates that is positive. Thus the filtered dataframe should look like this:
sample_id variable value measurement_date date2
1 a 3.389578 2021-06-01 2021-03-26
1 b 9.502077 2021-06-01 2021-03-20
1 a 2.815137 2021-09-21 2021-09-20
1 b 2.853771 2021-09-21 2021-08-05
but my code just returns an error..
> Error: Problem with `filter()` input `..1`.
i Input `..1` is `==...`.
x argument "values" is missing, with no default
EDIT
So I figured out how to get the output I was expecting, but had to create a new column and then filter based on that. I'd still like to know how to do this without creating the new column.
cheese$diff <- as.Date(cheese$measurement_date) - cheese$date2
filtered <- cheese %>% group_by(sample_id, variable, measurement_date) %>%
filter(diff == min(diff[diff>0]))
CodePudding user response:
We can replace the values that are less than or equal to 0 with an NA
, then use which.min
inside of slice
so that we don't have to create a new column.
library(tidyverse)
cheese %>%
mutate(measurement_date = as.Date(measurement_date)) %>%
group_by(sample_id, variable, measurement_date) %>%
slice(which.min((measurement_date - date2)*NA^((measurement_date - date2) <=0)))
Output
sample_id variable value measurement_date date2
<dbl> <chr> <dbl> <date> <date>
1 1 a 3.39 2021-06-01 2021-03-26
2 1 b 9.50 2021-06-01 2021-03-20
3 1 b 2.85 2021-08-22 2021-08-05
You could also directly use replace
instead of the shorter notation.
cheese %>%
mutate(measurement_date = as.Date(measurement_date)) %>%
group_by(sample_id, variable, measurement_date) %>%
slice(which.min(replace((measurement_date - date2), (measurement_date - date2)<=0, NA)))
Data
cheese <- structure(list(sample_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1), variable = c("a", "a", "a", "a", "a", "a", "b", "b", "b",
"b", "b", "b"), value = c(3.3895779682789, 4.34911509673111,
6.15568027016707, 9.17387010995299, 2.8151373793371, 9.08550716470927,
9.50207741744816, 6.94718013238162, 6.66202639508992, 1.55607643420808,
2.85377117409371, 2.58901077276096), measurement_date = c("2021-06-01",
"2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22", "2021-08-22",
"2021-06-01", "2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22",
"2021-08-22"), date2 = structure(c(18712, 18904, 18989, 18957,
18890, 18956, 18706, 18840, 18664, 18732, 18844, 18792), class = "Date")), class = "data.frame", row.names = c(NA,
-12L))