Home > Blockchain >  summarize row values based on two dates from another dataframe
summarize row values based on two dates from another dataframe

Time:11-23

Here is my data:

df1 <- fread('
   id ,      date1 ,    date2     
  id_0001 , 2017-01-01, 2017-01-05
  id_0002 , 2017-01-02, 2017-01-08
  id_0003 , 2017-01-04, 2017-01-07
 ')


df2<- fread('
        date ,     value     
        2017-01-01,   1
        2017-01-02,   2
        2017-01-03,   5
        2017-01-04,   5
        2017-01-05,   5 
        2017-01-06,   3
        2017-01-07,   4
        2017-01-08,   7
        2017-01-09,   5
        2017-01-10,   1
        2017-01-11,   5 
 ')

I want to summarize (get mean) the value from df2 by each id from df1 witinin the range between rowwise date1 and date2.

The result is like this:

id date1 date2 value
id_0001 2017-01-01 2017-01-05 mean(c(1,2,5,5,5))
id_0002 2017-01-02 2017-01-08 mean(c(2,5,5,5,3,4,7))
id_0003 2017-01-04 2017-01-07 mean(c(5,5,3,4))

I know that I can extend id by date1 and date2 in df1 and perform left_join by dates to df2 and then summarize. However, as data volume increases, r cannot handle vectors of a certain size when further analysis is needed. Is there a data.table way of doing this inter-dataframe summary?

CodePudding user response:

If means are inexact, it seems you're after an approach like below:

library(data.table)

df1[, value := df2[.SD, on = .(date >= date1, date <= date2), mean(value), by = .EACHI]$V1]

Output:

df1

        id      date1      date2    value
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000

CodePudding user response:

With pmap and between:

library(purrr)
library(dplyr)
df1 %>% 
  mutate(mean = pmap(across(date1:date2), ~ mean(df2$value[between(df2$date, ..1, ..2)])))

#        id      date1      date2     mean
#1: id_0001 2017-01-01 2017-01-05      3.6
#2: id_0002 2017-01-02 2017-01-08 4.428571
#3: id_0003 2017-01-04 2017-01-07     4.25

CodePudding user response:

Not data.table, but this dplyr approach avoids joining and should (?) require less memory:

library(dplyr)

df1 %>% 
  group_by(id) %>% 
  mutate(value = mean(df2$value[df2$date >= date1 & df2$date <= date2])) %>% 
  ungroup()
# A tibble: 3 × 4
  id      date1      date2      value
  <chr>   <date>     <date>     <dbl>
1 id_0001 2017-01-01 2017-01-05  3.6 
2 id_0002 2017-01-02 2017-01-08  4.43
3 id_0003 2017-01-04 2017-01-07  4.25

Or a similar base R approach, using sapply() by rows instead of grouping by id:

df1$value <- sapply(
  seq(nrow(df1)),
  \(i) mean(df2$value[df2$date >= df1$date1[[i]] & df2$date <= df1$date2[[i]]])
)
        id      date1      date2    value
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000

CodePudding user response:

Two ways to solve your problem using data.table package:

# Method 1
df1[, value := df2[date>=date1 & date<=date2, mean(value)], by=.(date1, date2)]

# Method 2
df1[, value := df2[.BY, mean(value), on=.(date>=date1, date<=date2)], by=.(date1, date2)]


        id      date1      date2    value
    <char>     <IDat>     <IDat>    <num>
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000

CodePudding user response:

Here is another non-equi join option within data.table

df2[df1,
  on = .(date >= date1, date <= date2)
][
  ,
  .(value = mean(value)), 
  .(id, date1 = date, date2 = date.1)
]

which gives

        id      date1      date2    value
1: id_0001 2017-01-01 2017-01-05 3.600000
2: id_0002 2017-01-02 2017-01-08 4.428571
3: id_0003 2017-01-04 2017-01-07 4.250000
  • Related