I have a very large dataframe with > 300 date columns (and >100,000 rows). Each row also contains an 'index date'. For each row, I want to identify whether any of the subsequent date columns contain a date which is within 6 months of the index date. I then want an output of a separate column which tells me WHICH of the date columns for that row contained the date which was within 6 months of the index date. I'm afraid I haven't really tried anything meaningful as am not sure where to start but I've provided a simplified example of what I am trying to achieve below. I would be very grateful for any ideas please!
Example dataframe
library(dplyr)
example <- data.frame(
'index_date' = as.Date(c('2006/01/01', '2007/01/01', '2008/01/01', '2009/01/01', '2010/01/01')),
'date_1' = as.Date(c('2006/02/01', '2004/01/01', '2004/01/01', '2004/01/01', '2004/01/01')),
'date_2' = as.Date(c('2011/02/01', '2006/12/01', '2005/01/01', '2005/01/01', '2005/01/01')),
'date_3' = as.Date(c('2012/02/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/03/01')),
'date_4' = as.Date(c('2013/03/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/06/01'))) %>%
# now create the `date_range` columns for each row. `start_range` is 6 months prior to the index date (182.6 days) and `end_range` is 6 months (182.6 days) after the index date
mutate(start_range = (index_date - 182.6),
end_range = (index_date 182.6))
This produces the following dataframe
index_date date_1 date_2 date_3 date_4 start_range end_range
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2006-03-01 2005-07-02 2006-07-02
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01 2006-07-02 2007-07-02
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2007-07-02 2008-07-01
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2008-07-02 2009-07-02
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01 2009-07-02 2010-07-02
Now what I would like to do is create a further column called matching_date
which, for each row, tells me which of the date columns are within the date range. Note that row 5 contains two potential dates within range (date_3
and date_4
), so ideally I also need to incorporate a function which identifies the date which is closest to the index date (in this case it would be date_3
).
Desired output
index_date date_1 date_2 date_3 date_4 start_range end_range matching_date
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2006-03-01 2005-07-02 2006-07-02 date_1
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01 2006-07-02 2007-07-02 date_2
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2007-07-02 2008-07-01 NA
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2008-07-02 2009-07-02 NA
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01 2009-07-02 2010-07-02 date_3
CodePudding user response:
An approach using dplyr
.
- Find the rows that fall into the range.
- Get the minimum distance of dates to midpoint of the ranges (
index_date
). - Match the minimum distance between
Min
andindex_date
of the matching rangeIs
with the associated colname.
library(dplyr)
example %>%
rowwise() %>%
mutate(Is = any(across(date_1:date_4, ~ .x >= start_range & .x <= end_range)),
Min = which.min(across(date_1:date_4, ~ abs(.x - index_date))),
matching_date = if_else(Is,
unlist(list(colnames(across(date_1:date_4))))[Min], NA_character_),
is = NULL, Min = NULL) %>%
ungroup() %>%
print(Inf)
# A tibble: 5 × 8
index_date date_1 date_2 date_3 date_4 start_range end_range
<date> <date> <date> <date> <date> <date> <date>
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2013-03-01 2005-07-02 2006-07-02
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01 2006-07-02 2007-07-02
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2007-07-02 2008-07-01
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2008-07-02 2009-07-02
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01 2009-07-02 2010-07-02
matching_date
<chr>
1 date_1
2 date_2
3 NA
4 NA
5 date_3
Data
example <- structure(list(index_date = structure(c(13149, 13514, 13879,
14245, 14610), class = "Date"), date_1 = structure(c(13180, 12418,
12418, 12418, 12418), class = "Date"), date_2 = structure(c(15006,
13483, 12784, 12784, 12784), class = "Date"), date_3 = structure(c(15371,
15675, 15340, 15340, 14669), class = "Date"), date_4 = structure(c(15765,
15675, 15340, 15340, 14761), class = "Date"), start_range = structure(c(12966.4,
13331.4, 13696.4, 14062.4, 14427.4), class = "Date"), end_range = structure(c(13331.6,
13696.6, 14061.6, 14427.6, 14792.6), class = "Date")), class = "data.frame", row.names = c(NA,
-5L))
CodePudding user response:
Assuming that your records are uniquely identified by index_date
, you may go with this:
library(tidyverse)
example <- data.frame(
'index_date' = as.Date(c('2006/01/01', '2007/01/01', '2008/01/01', '2009/01/01', '2010/01/01')),
'date_1' = as.Date(c('2006/02/01', '2004/01/01', '2004/01/01', '2004/01/01', '2004/01/01')),
'date_2' = as.Date(c('2011/02/01', '2006/12/01', '2005/01/01', '2005/01/01', '2005/01/01')),
'date_3' = as.Date(c('2012/02/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/03/01')),
'date_4' = as.Date(c('2013/03/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/06/01'))) %>%
# now create the `date_range` columns for each row. `start_range` is 6 months prior to the index date (182.6 days) and `end_range` is 6 months (182.6 days) after the index date
mutate(start_range = (index_date - 182.6),
end_range = (index_date 182.6))
example_long <-
example |>
pivot_longer(
cols=starts_with("date"),
names_to="vars",
values_to="dates") |>
mutate(diff = dates - index_date) |>
rowwise() |>
mutate(matching_date = ifelse(between(dates, start_range, end_range), vars, NA)) |>
filter(!is.na(matching_date)) |>
group_by(index_date) |>
mutate(tie = which(diff == min(diff))) |>
filter(tie == row_number()) |>
select(index_date, matching_date)
left_join(example, example_long)
#> Joining, by = "index_date"
#> index_date date_1 date_2 date_3 date_4 start_range end_range
#> 1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2013-03-01 2005-07-02 2006-07-02
#> 2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01 2006-07-02 2007-07-02
#> 3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2007-07-02 2008-07-01
#> 4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2008-07-02 2009-07-02
#> 5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01 2009-07-02 2010-07-02
#> matching_date
#> 1 date_1
#> 2 date_2
#> 3 <NA>
#> 4 <NA>
#> 5 date_3
Created on 2023-01-05 with reprex v2.0.2