Home > Net >  How to join based on a criteria using R/dplyr?
How to join based on a criteria using R/dplyr?

Time:10-02

There's a hard task and a harder task here...

So let's say theres a dataset as follows:

DF1

ID     source                    timestamp
001         D       3/17/2021   1:22:10 AM
002         A       1/02/2021   8:25:20 AM         
002         A       5/17/2021  10:25:20 PM
003         B       5/11/2021   8:25:20 AM
003         D       5/17/2021   3:12:30 PM
004         A       5/17/2021   8:25:20 AM
004         B       5/17/2021   8:26:20 PM
004         C       5/17/2021   8:27:20 PM

Note: Timestamp is date and time in one var)

All I need to do, is have a new dataframe which is just the ID, SOURCE, AND TIMESTAMP of the MOST RECENT ENTRY.

Final result should look like this:

ID     source                    timestamp
001         D       3/17/2021   1:22:10 AM
002         A       5/17/2021  10:25:20 PM
003         D       5/17/2021   3:12:30 PM
004         C       5/17/2021   8:27:20 PM

Completely separately, I have this dataset:

DF2

ID      event                    timestamp
001         y       3/17/2021   1:23:10 AM
002         y       3/12/2021  12:25:20 PM
003         y       5/12/2021   1:12:30 PM
004         y       5/17/2021   8:26:30 AM

I would like to join DF2 to DF1 based on the most recent ID. What I mean by this, DF1 has 2 entries for ID = 002. I would like DF2 to be joined on ID but only for the timestamp that is closest without going past .

Final Result:

ID     source               df1.1timestamp   event
001         D       3/17/2021   1:22:10 AM       y
002         A       1/02/2021   8:25:20 AM       y  
002         A       5/17/2021  10:25:20 PM
003         B       5/11/2021   8:25:20 AM       y
003         D       5/17/2021   3:12:30 PM
004         A       5/17/2021   8:25:20 AM
004         B       5/17/2021   8:26:20 PM       y
004         C       5/17/2021   8:27:20 PM

As you can see here, the events in DF2 are joined to the most recent timestamp in DF1 . In the case of ID = 004 , the join occured on 5/17/2021 8:26:20 PM because in DF2, the event occurred on 5/17/2021 8:26:30 AM. 5/17/2021 8:26:20 PM is the row in DF1 which was the event immediately prior to the event.

CodePudding user response:

This type of join isn't easily do-able in dplyr*, so here's a data.table solution.

Read in data. In the future please use dput so that users can get the example data directly into their session.

library(data.table)

df1 <- fread('
ID     , source    ,               timestamp
001    ,      D    ,   3/17/2021   1:22:10 AM
002    ,      A    ,   1/02/2021   8:25:20 AM         
002    ,      A    ,   5/17/2021  10:25:20 PM
003    ,      B    ,   5/11/2021   8:25:20 AM
003    ,      D    ,   5/17/2021   3:12:30 PM
004    ,      A    ,   5/17/2021   8:25:20 AM
004    ,      B    ,   5/17/2021   8:26:20 PM
004    ,      C    ,   5/17/2021   8:27:20 PM
')

df2 <- fread('
ID      , event ,                   timestamp
001     ,     y     ,  3/17/2021   1:23:10 AM
002     ,     y     ,  3/12/2021  12:25:20 PM
003     ,     y     ,  5/12/2021   1:12:30 PM
004     ,     y     ,  5/17/2021   8:26:30 AM
')

# not required here, but would be if not read in as data.table
setDT(df1)
setDT(df2)

Format date-times and set keys for data tables

df1[, timestamp := as.POSIXct(timestamp, format = '%m/%d/%Y %r')]
df2[, timestamp := as.POSIXct(timestamp, format = '%m/%d/%Y %r')]

setkey(df1, ID, timestamp)
setkey(df2, ID, timestamp)

Part 1


df1[, tail(.SD, 1), ID]
#>    ID source           timestamp
#> 1:  1      D 2021-03-17 01:22:10
#> 2:  2      A 2021-05-17 22:25:20
#> 3:  3      D 2021-05-17 15:12:30
#> 4:  4      C 2021-05-17 20:27:20

Part 2:

df1[df2, event := i.event, on = .(ID), mult = 'first']

df1
#>    ID source           timestamp event
#> 1:  1      D 2021-03-17 01:22:10     y
#> 2:  2      A 2021-01-02 08:25:20     y
#> 3:  2      A 2021-05-17 22:25:20  <NA>
#> 4:  3      B 2021-05-11 08:25:20     y
#> 5:  3      D 2021-05-17 15:12:30  <NA>
#> 6:  4      A 2021-05-17 08:25:20     y
#> 7:  4      B 2021-05-17 20:26:20  <NA>
#> 8:  4      C 2021-05-17 20:27:20  <NA>

Created on 2021-09-30 by the reprex package (v2.0.1)

* Although it looks like it will be soon: https://github.com/tidyverse/dplyr/pull/5910

  • Related