Home > Mobile >  Filter based on minimum date differences greater than zero dplyr
Filter based on minimum date differences greater than zero dplyr

Time:03-23

I have a dataframe with two columns containing dates. I'd like to filter each sample id /variable/measurement_date combination based on the min difference greater than zero between measurement_date and date2. I've tried grouping the dataframe by sample id, variable and measurement_date, then filtering it by replacing all date differences less then zero with NA. Here is my code:

start_date <- as.Date('2021-01-01')  
end_date <- as.Date('2021-12-30') 
set.seed(1)

cheese <- data.frame(sample_id = c(1,1,1,1,1,1,1,1,1,1,1,1),
                     variable = c('a','a','a','a','a','a',
                                  'b','b','b','b','b','b'),
                     value = runif(n = 12, min = 1, max = 10),
                     measurement_date = c('2021-06-01','2021-06-01','2021-06-01',
                                                  '2021-09-21','2021-09-21','2021-09-21',
                                                  '2021-06-01','2021-06-01','2021-06-01',
                                                  '2021-09-21','2021-09-21','2021-09-21'),
                     date2 = as.Date(sample( as.numeric(start_date): as.numeric(end_date), 12, 
                                             replace = T), 
                                             origin = '1970-01-01'))

library(dplyr)
filtered <- cheese %>% group_by(sample_id, variable, measurement_date)
filtered %>%
  filter(difftime(measurement_date,date2) == 
           min(replace(difftime(measurement_date, date2))>0,NA),na.rm=TRUE)

This is the sample dataframe:

sample_id   variable    value       measurement_date    date2
1           a           3.389578    2021-06-01          2021-03-26
1           a           4.349115    2021-06-01          2021-10-04
1           a           6.155680    2021-06-01          2021-12-28
1           a           9.173870    2021-09-21          2021-11-26
1           a           2.815137    2021-09-21          2021-09-20
1           a           9.085507    2021-09-21          2021-11-25
1           b           9.502077    2021-06-01          2021-03-20
1           b           6.947180    2021-06-01          2021-08-01
1           b           6.662026    2021-06-01          2021-02-06
1           b           1.556076    2021-09-21          2021-04-15
1           b           2.853771    2021-09-21          2021-08-05
1           b           2.589011    2021-09-21          2021-06-14

In this example, there are four sample_id / variable / measurement date combinations:

sample_id, variable, measurement_date
1,a,2021-06-01
1,b,2021-06-01
1,a,2021-09,21
1,b,2021-09-21

I'd like to group the dataframe by these combinations, then look at the difference between the measurement_date and date2, finding the minimum difference between the dates that is positive. Thus the filtered dataframe should look like this:

sample_id   variable    value       measurement_date    date2
1           a           3.389578    2021-06-01          2021-03-26
1           b           9.502077    2021-06-01          2021-03-20
1           a           2.815137    2021-09-21          2021-09-20
1           b           2.853771    2021-09-21          2021-08-05

but my code just returns an error..

> Error: Problem with `filter()` input `..1`.
i Input `..1` is `==...`.
x argument "values" is missing, with no default

EDIT

So I figured out how to get the output I was expecting, but had to create a new column and then filter based on that. I'd still like to know how to do this without creating the new column.

cheese$diff <- as.Date(cheese$measurement_date) - cheese$date2
filtered <- cheese %>% group_by(sample_id, variable, measurement_date) %>% 
    filter(diff == min(diff[diff>0]))

CodePudding user response:

We can replace the values that are less than or equal to 0 with an NA, then use which.min inside of slice so that we don't have to create a new column.

library(tidyverse)

cheese %>%
  mutate(measurement_date = as.Date(measurement_date)) %>%
  group_by(sample_id, variable, measurement_date) %>%
  slice(which.min((measurement_date - date2)*NA^((measurement_date - date2) <=0)))

Output

  sample_id variable value measurement_date date2     
      <dbl> <chr>    <dbl> <date>           <date>    
1         1 a         3.39 2021-06-01       2021-03-26
2         1 b         9.50 2021-06-01       2021-03-20
3         1 b         2.85 2021-08-22       2021-08-05

You could also directly use replace instead of the shorter notation.

cheese %>%
  mutate(measurement_date = as.Date(measurement_date)) %>%
  group_by(sample_id, variable, measurement_date) %>%
  slice(which.min(replace((measurement_date - date2), (measurement_date - date2)<=0, NA)))

Data

cheese <- structure(list(sample_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1), variable = c("a", "a", "a", "a", "a", "a", "b", "b", "b", 
"b", "b", "b"), value = c(3.3895779682789, 4.34911509673111, 
6.15568027016707, 9.17387010995299, 2.8151373793371, 9.08550716470927, 
9.50207741744816, 6.94718013238162, 6.66202639508992, 1.55607643420808, 
2.85377117409371, 2.58901077276096), measurement_date = c("2021-06-01", 
"2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22", "2021-08-22", 
"2021-06-01", "2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22", 
"2021-08-22"), date2 = structure(c(18712, 18904, 18989, 18957, 
18890, 18956, 18706, 18840, 18664, 18732, 18844, 18792), class = "Date")), class = "data.frame", row.names = c(NA, 
-12L))
  • Related