Home > Software design >  Merge 2 data frames using common date, plus 2 rows before and n-1 rows after
Merge 2 data frames using common date, plus 2 rows before and n-1 rows after

Time:06-28

So i need to merge 2 data frames: The first data frame contains dates in YYYY-mm-dd format and event lengths:

datetime   length
2003-06-03      1
2003-06-07      1
2003-06-13      1
2003-06-17      3
2003-06-28      5
2003-07-10      1
2003-07-23      1
...

The second data frame contains dates in the same format and discharge data:

  datetime      q
2003-05-29   36.2
2003-05-30   34.6
2003-05-31   33.1
2003-06-01   30.7
2003-06-02   30.0
2003-06-03  153.0
2003-06-04   69.0
...

The second data frame is much larger. I want to merge/join only the following rows of the second data frame to the first:

  • all rows that have the same date as the first frame (I know this can be done with left_join(df1,df2, by = c("datetime"))
  • two rows before that row
  • n-1 rows after that row, where n = "length" value of row in first data frame.

I would like to identify the rows belonging to the same event as well. Ideally i would have the following output: (Notice the event from 2003-06-17)

EventDatesNancy length q       event#
2003-06-03      1      153.0   1
2003-06-07      1      120.0   2
2003-06-13      1      45.3    3
2003-06-15      na     110.0   4
2003-06-16      na     53.1    4
2003-06-17      3      78.0    4
2003-06-18      na     167.0   4
2003-06-19      na     145.0   4
...

I hope this makes clear what I am trying to do.

CodePudding user response:

This might be one approach using tidyverse and fuzzyjoin.

First, indicate event numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length days - 1 after the date).

Then, you can use fuzzy_inner_join to get the selected rows from the second data.frame. Here, you will want to include where the datetime in the second data.frame falls after the start date and before the end date of the first data.frame.

library(tidyverse)
library(fuzzyjoin)

df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime   df1$length - 1

fuzzy_inner_join(
  df1,
  df2,
  by = c("start_date" = "datetime", "end_date" = "datetime"),
  match_fun = c(`<=`, `>=`)
) %>%
  select(datetime.y, length, q, event)

I tried this out with some made up data:

R> df1
     datetime length
1  2003-06-03      1
2  2003-06-12      1
3  2003-06-21      1
4  2003-06-30      3
5  2003-07-09      5
6  2003-07-18      1
7  2003-07-27      1
8  2003-08-05      2
9  2003-08-14      1
10 2003-08-23      1
11 2003-09-01      3

R> df2
     datetime  q
1  2003-06-03 44
2  2003-06-04 52
3  2003-06-05 34
4  2003-06-06 20
5  2003-06-07 57
6  2003-06-08 67
7  2003-06-09 63
8  2003-06-10 51
9  2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16  6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03  9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70

This will give the following output:

   datetime.y length  q event
1  2003-06-03      1 44     1
2  2003-06-10      1 51     2
3  2003-06-11      1 56     2
4  2003-06-12      1 37     2
5  2003-06-19      1 61     3
6  2003-06-20      1 42     3
7  2003-06-21      1 28     3
8  2003-06-28      3 38     4
9  2003-06-29      3 27     4
10 2003-06-30      3 49     4
11 2003-07-01      3 10     4
12 2003-07-02      3 89     4
13 2003-07-07      5 31     5
14 2003-07-08      5 29     5
15 2003-07-09      5 84     5
16 2003-07-10      5 60     5
17 2003-07-11      5 19     5
18 2003-07-12      5 97     5
19 2003-07-13      5 35     5

If the output desired is different than above, please let me know what should be different so that I can correct it.


Data

df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233, 
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"), 
    length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11, 
    start_date = structure(c(12204, 12213, 12222, 12231, 12240, 
    12249, 12258, 12267, 12276, 12285, 12294), class = "Date"), 
    end_date = structure(c(12206, 12215, 12224, 12235, 12246, 
    12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA, 
-11L), class = "data.frame")

df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209, 
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218, 
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227, 
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236, 
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245, 
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L, 
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L, 
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L, 
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))
  • Related