There's a hard task and a harder task here...
So let's say theres a dataset as follows:
DF1
ID source timestamp
001 D 3/17/2021 1:22:10 AM
002 A 1/02/2021 8:25:20 AM
002 A 5/17/2021 10:25:20 PM
003 B 5/11/2021 8:25:20 AM
003 D 5/17/2021 3:12:30 PM
004 A 5/17/2021 8:25:20 AM
004 B 5/17/2021 8:26:20 PM
004 C 5/17/2021 8:27:20 PM
Note: Timestamp is date and time in one var)
All I need to do, is have a new dataframe which is just the ID, SOURCE, AND TIMESTAMP
of the MOST RECENT ENTRY.
Final result should look like this:
ID source timestamp
001 D 3/17/2021 1:22:10 AM
002 A 5/17/2021 10:25:20 PM
003 D 5/17/2021 3:12:30 PM
004 C 5/17/2021 8:27:20 PM
Completely separately, I have this dataset:
DF2
ID event timestamp
001 y 3/17/2021 1:23:10 AM
002 y 3/12/2021 12:25:20 PM
003 y 5/12/2021 1:12:30 PM
004 y 5/17/2021 8:26:30 AM
I would like to join DF2 to DF1 based on the most recent ID. What I mean by this, DF1 has 2 entries for ID = 002
. I would like DF2 to be joined on ID but only for the timestamp that is closest without going past .
Final Result:
ID source df1.1timestamp event
001 D 3/17/2021 1:22:10 AM y
002 A 1/02/2021 8:25:20 AM y
002 A 5/17/2021 10:25:20 PM
003 B 5/11/2021 8:25:20 AM y
003 D 5/17/2021 3:12:30 PM
004 A 5/17/2021 8:25:20 AM
004 B 5/17/2021 8:26:20 PM y
004 C 5/17/2021 8:27:20 PM
As you can see here, the events in DF2 are joined to the most recent timestamp in DF1 . In the case of ID = 004
, the join occured on 5/17/2021 8:26:20 PM
because in DF2, the event occurred on 5/17/2021 8:26:30 AM
. 5/17/2021 8:26:20 PM
is the row in DF1 which was the event immediately prior to the event.
CodePudding user response:
This type of join isn't easily do-able in dplyr*, so here's a data.table solution.
Read in data. In the future please use dput
so that users can get the example data directly into their session.
library(data.table)
df1 <- fread('
ID , source , timestamp
001 , D , 3/17/2021 1:22:10 AM
002 , A , 1/02/2021 8:25:20 AM
002 , A , 5/17/2021 10:25:20 PM
003 , B , 5/11/2021 8:25:20 AM
003 , D , 5/17/2021 3:12:30 PM
004 , A , 5/17/2021 8:25:20 AM
004 , B , 5/17/2021 8:26:20 PM
004 , C , 5/17/2021 8:27:20 PM
')
df2 <- fread('
ID , event , timestamp
001 , y , 3/17/2021 1:23:10 AM
002 , y , 3/12/2021 12:25:20 PM
003 , y , 5/12/2021 1:12:30 PM
004 , y , 5/17/2021 8:26:30 AM
')
# not required here, but would be if not read in as data.table
setDT(df1)
setDT(df2)
Format date-times and set keys for data tables
df1[, timestamp := as.POSIXct(timestamp, format = '%m/%d/%Y %r')]
df2[, timestamp := as.POSIXct(timestamp, format = '%m/%d/%Y %r')]
setkey(df1, ID, timestamp)
setkey(df2, ID, timestamp)
Part 1
df1[, tail(.SD, 1), ID]
#> ID source timestamp
#> 1: 1 D 2021-03-17 01:22:10
#> 2: 2 A 2021-05-17 22:25:20
#> 3: 3 D 2021-05-17 15:12:30
#> 4: 4 C 2021-05-17 20:27:20
Part 2:
df1[df2, event := i.event, on = .(ID), mult = 'first']
df1
#> ID source timestamp event
#> 1: 1 D 2021-03-17 01:22:10 y
#> 2: 2 A 2021-01-02 08:25:20 y
#> 3: 2 A 2021-05-17 22:25:20 <NA>
#> 4: 3 B 2021-05-11 08:25:20 y
#> 5: 3 D 2021-05-17 15:12:30 <NA>
#> 6: 4 A 2021-05-17 08:25:20 y
#> 7: 4 B 2021-05-17 20:26:20 <NA>
#> 8: 4 C 2021-05-17 20:27:20 <NA>
Created on 2021-09-30 by the reprex package (v2.0.1)
* Although it looks like it will be soon: https://github.com/tidyverse/dplyr/pull/5910