Here is my data:
df1 <- fread('
id , date1 , date2
id_0001 , 2017-01-01, 2017-01-05
id_0002 , 2017-01-02, 2017-01-08
id_0003 , 2017-01-04, 2017-01-07
')
df2<- fread('
date , value
2017-01-01, 1
2017-01-02, 2
2017-01-03, 5
2017-01-04, 5
2017-01-05, 5
2017-01-06, 3
2017-01-07, 4
2017-01-08, 7
2017-01-09, 5
2017-01-10, 1
2017-01-11, 5
')
I want to summarize (get mean) the value
from df2
by each id
from df1
witinin the range between rowwise date1
and date2
.
The result is like this:
id | date1 | date2 | value |
---|---|---|---|
id_0001 | 2017-01-01 | 2017-01-05 | mean(c(1,2,5,5,5)) |
id_0002 | 2017-01-02 | 2017-01-08 | mean(c(2,5,5,5,3,4,7)) |
id_0003 | 2017-01-04 | 2017-01-07 | mean(c(5,5,3,4)) |
I know that I can extend id
by date1
and date2
in df1
and perform left_join
by dates
to df2
and then summarize
. However, as data volume increases, r cannot handle vectors of a certain size when further analysis is needed. Is there a data.table
way of doing this inter-dataframe summary?
CodePudding user response:
If means are inexact, it seems you're after an approach like below:
library(data.table)
df1[, value := df2[.SD, on = .(date >= date1, date <= date2), mean(value), by = .EACHI]$V1]
Output:
df1
id date1 date2 value
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000
CodePudding user response:
With pmap
and between
:
library(purrr)
library(dplyr)
df1 %>%
mutate(mean = pmap(across(date1:date2), ~ mean(df2$value[between(df2$date, ..1, ..2)])))
# id date1 date2 mean
#1: id_0001 2017-01-01 2017-01-05 3.6
#2: id_0002 2017-01-02 2017-01-08 4.428571
#3: id_0003 2017-01-04 2017-01-07 4.25
CodePudding user response:
Not data.table, but this dplyr approach avoids joining and should (?) require less memory:
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(value = mean(df2$value[df2$date >= date1 & df2$date <= date2])) %>%
ungroup()
# A tibble: 3 × 4
id date1 date2 value
<chr> <date> <date> <dbl>
1 id_0001 2017-01-01 2017-01-05 3.6
2 id_0002 2017-01-02 2017-01-08 4.43
3 id_0003 2017-01-04 2017-01-07 4.25
Or a similar base R approach, using sapply()
by rows instead of grouping by id
:
df1$value <- sapply(
seq(nrow(df1)),
\(i) mean(df2$value[df2$date >= df1$date1[[i]] & df2$date <= df1$date2[[i]]])
)
id date1 date2 value
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000
CodePudding user response:
Two ways to solve your problem using data.table package:
# Method 1
df1[, value := df2[date>=date1 & date<=date2, mean(value)], by=.(date1, date2)]
# Method 2
df1[, value := df2[.BY, mean(value), on=.(date>=date1, date<=date2)], by=.(date1, date2)]
id date1 date2 value
<char> <IDat> <IDat> <num>
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000
CodePudding user response:
Here is another non-equi join
option within data.table
df2[df1,
on = .(date >= date1, date <= date2)
][
,
.(value = mean(value)),
.(id, date1 = date, date2 = date.1)
]
which gives
id date1 date2 value
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000