I have to data sets. One is called "yourdf" and contains a start time called "Start" and some other columns. Lets assume the following data set for this.
yourdf <- data.frame (Start = c("2022-06-10 10:00:00", "2022-06-02 10:00:20", "2022-06-10 10:30:00", "2022-06-10 10:20:00", "2022-06-10 10:21:00"),
User = c("A", "A", "C", "A", "B"))
Then I have a second data sets with time points and "credits" or "points".
mydf <- data.frame (Timepoint = c("2022-06-10 10:00:03", "2022-06-02 10:00:20", "2022-06-10 10:30:00", "2022-06-10 10:20:00", "2022-06-10 10:21:00",
"2022-06-01 10:10:10", "2022-06-11 11:00:10", "2022-06-10 10:00:00", "2022-06-10 13:00:00", "2022-04-12 10:00:00",
"2022-06-09 10:00:00", "2022-06-10 10:00:00", "2022-06-10 10:11:20", "2022-06-10 10:00:05", "2022-06-04 10:00:00", "2022-06-10 10:11:12"),
Points = c("2", "4", "1", "5", "12",
"3", "2", "3", "2", "1",
"7", "2", "1", "0", "2", "1"))
What I'm looking for: I want to sum up the credits of table mydf for specific time spans!
How I have to "aggregate" the data? Let's work with the first time entry of table "yourdf". This one is "2022-06-10 10:00:00
". I want to add a period of "10 minutes" on this. So I can add a column called "End
"
This can be done by
yourdf["End"] <- as.POSIXct(yourdf$Start) minutes(10)
Note: yourdf$Start
has the class "character"
So we have a new table yourdf with a "Start" and a "End" date.
Now I want to sum up all "points
" of table "mydf
" between the "Start
" and the "End
" date of table yourdf
Yourdf
looks like this
Start User End
1 2022-06-10 10:00:00 A 2022-06-10 10:10:00
2 2022-06-02 10:00:20 A 2022-06-02 10:10:20
3 2022-06-10 10:30:00 C 2022-06-10 10:40:00
4 2022-06-10 10:20:00 A 2022-06-10 10:30:00
5 2022-06-10 10:21:00 B 2022-06-10 10:31:00
Let's look a line 1. Start time is "2022-06-10 10:00:00
" and "2022-06-10 10:10:00
".
So I want to generate a table like this one
Start End Points
1 2022-06-10 10:00:00 2022-06-10 10:10:00 7
2 2022-06-02 10:00:20 2022-06-02 10:10:20
3 2022-06-10 10:30:00 2022-06-10 10:40:00
4 2022-06-10 10:20:00 2022-06-10 10:30:00
5 2022-06-10 10:21:00 2022-06-10 10:31:00
because just the entries
1 2022-06-10 10:00:03 2
8 2022-06-10 10:00:00 3
12 2022-06-10 10:00:00 2
14 2022-06-10 10:00:05 0
of table "mydf
" are between the "Start
" and the "End
" point.
How can I achieve this with Project R for large tables with thousands of entries?
Thanks a lot for your help!
CodePudding user response:
a data.table
approach
library(data.table)
# set to data.table format
setDT(yourdf); setDT(mydf)
# set to posix timestamp
yourdf[, Start := as.POSIXct(Start)]
mydf[, Timepoint := as.POSIXct(Timepoint)]
# create a unique rowwise key for yourdf
yourdf[, id := .I][]
setkey(yourdf, id)
# loop over each row/key in yourdf (by = .EACHI),
# subset mydf based on Start 10 minutes, and sum Points
yourdf[yourdf, .(Start,
End = Start 600,
Points = mydf[Timepoint