Sum up "points" between time points over two tables-CodePudding

I have to data sets. One is called "yourdf" and contains a start time called "Start" and some other columns. Lets assume the following data set for this.

yourdf <- data.frame (Start = c("2022-06-10 10:00:00", "2022-06-02 10:00:20", "2022-06-10 10:30:00", "2022-06-10 10:20:00", "2022-06-10 10:21:00"),
                    User = c("A", "A", "C", "A", "B"))

Then I have a second data sets with time points and "credits" or "points".

mydf <- data.frame (Timepoint = c("2022-06-10 10:00:03", "2022-06-02 10:00:20", "2022-06-10 10:30:00", "2022-06-10 10:20:00", "2022-06-10 10:21:00",
                             "2022-06-01 10:10:10", "2022-06-11 11:00:10", "2022-06-10 10:00:00", "2022-06-10 13:00:00", "2022-04-12 10:00:00",
                             "2022-06-09 10:00:00", "2022-06-10 10:00:00", "2022-06-10 10:11:20", "2022-06-10 10:00:05", "2022-06-04 10:00:00", "2022-06-10 10:11:12"),
                    Points = c("2", "4", "1", "5", "12",
                             "3", "2", "3", "2", "1",
                             "7", "2", "1", "0", "2", "1"))

What I'm looking for: I want to sum up the credits of table mydf for specific time spans!

How I have to "aggregate" the data? Let's work with the first time entry of table "yourdf". This one is "2022-06-10 10:00:00". I want to add a period of "10 minutes" on this. So I can add a column called "End"

This can be done by

yourdf["End"] <- as.POSIXct(yourdf$Start)   minutes(10)

Note: yourdf$Start has the class "character"

So we have a new table yourdf with a "Start" and a "End" date.

Now I want to sum up all "points" of table "mydf" between the "Start" and the "End" date of table yourdf

Yourdf looks like this

                Start User                 End
1 2022-06-10 10:00:00    A 2022-06-10 10:10:00
2 2022-06-02 10:00:20    A 2022-06-02 10:10:20
3 2022-06-10 10:30:00    C 2022-06-10 10:40:00
4 2022-06-10 10:20:00    A 2022-06-10 10:30:00
5 2022-06-10 10:21:00    B 2022-06-10 10:31:00

Let's look a line 1. Start time is "2022-06-10 10:00:00" and "2022-06-10 10:10:00".

So I want to generate a table like this one

                Start                  End      Points
1 2022-06-10 10:00:00  2022-06-10 10:10:00          7
2 2022-06-02 10:00:20  2022-06-02 10:10:20
3 2022-06-10 10:30:00  2022-06-10 10:40:00
4 2022-06-10 10:20:00  2022-06-10 10:30:00
5 2022-06-10 10:21:00  2022-06-10 10:31:00

because just the entries

1  2022-06-10 10:00:03      2
8  2022-06-10 10:00:00      3
12 2022-06-10 10:00:00      2
14 2022-06-10 10:00:05      0

of table "mydf" are between the "Start" and the "End" point.

How can I achieve this with Project R for large tables with thousands of entries?

Thanks a lot for your help!

CodePudding user response：

a data.table approach

library(data.table)
# set to data.table format
setDT(yourdf); setDT(mydf)
# set to posix timestamp
yourdf[, Start := as.POSIXct(Start)]
mydf[, Timepoint := as.POSIXct(Timepoint)]
# create a unique rowwise key for yourdf
yourdf[, id := .I][]
setkey(yourdf, id)
# loop over each row/key in yourdf (by = .EACHI),
# subset mydf based on Start   10 minutes, and sum Points
yourdf[yourdf, .(Start, 
                 End = Start   600,
                 Points = mydf[Timepoint