Mutate data frame with data from another data frame-CodePudding

Reproducible Data:

df1 <- tibble(id = c("GR1","GR2"),
              area = c("A1","A2"),
              date1 = as.Date(c("2022-01-01","2022-01-02")),
              date2 = as.Date(c("2022-01-06","2022-01-08")))

set.seed(543)
df2 <- tibble(date3 = seq(as.Date("2022-01-01"), as.Date("2022-01-09"), "days"),
              temperature =runif(9, min = 28, max = 33),
              area = c("A1","A2","A1","A2","A1","A2","A1","A2","A1"))

Hello, I want to create a column in df1 with the average temperature resulting in a filter from df2. (In the real data frames I have 1036 rows in df1 and 26192 rows in df2. )

I tried this approach, but it doesn't work as I thought

df3 <- df1 %>%
  group_by(area) %>%
  mutate(average_temp = mean(filter(.data = df2, date3 >= df1$date1 & date3 <= df1$date2 & area == df1$area)$temperature))

I get this error

Warning messages:

1: Problem while computing average_temp = mean(...).

i longer object length is not a multiple of shorter object length

The expected result is

id	area	date1	date2	average_temp
GR1	A1	2022-01-01	2022-02-12	31.58708
GR2	A2	2022-01-02	2022-02-11	30.50867

This chunk of code by itself gives the expected result. So the problem must be something I'm not seeing in the iteration of rows withing the logic in mutate and dplyr syntax.

mean(filter(.data = df2, date3 >= df1$date1[2] & date3 <= df1$date2[2] & area == df1$area[2])$temperature)

CodePudding user response：

This is a non-equi or range-based join. Unfortunately, dplyr by itself cannot do that, so we need the help of another package. Options below:

fuzzyjoin

fuzzyjoin::fuzzy_left_join(
  df1, df2,
  by = c("area", date1="date3", date2="date3"),
  match_fun=list(`==`, `<=`, `>=`)
) %>%
  group_by(id, date1, date2) %>%
  summarize(
    area = area.x[1],
    avg = mean(temperature)
  ) %>%
  ungroup()
# `summarise()` has grouped output by 'id', 'date1'. You can override using the `.groups` argument.
# # A tibble: 2 x 5
#   id    date1      date2      area    avg
#   <chr> <date>     <date>     <chr> <dbl>
# 1 GR1   2022-01-01 2022-01-06 A1     31.6
# 2 GR2   2022-01-02 2022-01-08 A2     30.5

data.table

library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT1[DT2, avg := ave(i.temperature, id, FUN = mean),
    on = .(area, date1 <= date3, date2 >= date3) ]
#        id   area      date1      date2      avg
#    <char> <char>     <Date>     <Date>    <num>
# 1:    GR1     A1 2022-01-01 2022-01-06 31.58708
# 2:    GR2     A2 2022-01-02 2022-01-08 30.50867

(I know there's a more canonical way to do this without ave, but I ran out of time ...)

sqldf

# library(sqldf) # not required to load, per se
sqldf::sqldf(
  "select df1.id, df1.area, df1.date1, df1.date2,
     avg(df2.temperature) as avg
   from df1
     left join df2 on df1.area=df2.area
       and df2.date3 between df1.date1 and df1.date2
   group by df1.id, df1.area, df1.date1, df1.date2")
#    id area      date1      date2      avg
# 1 GR1   A1 2022-01-01 2022-01-06 31.58708
# 2 GR2   A2 2022-01-02 2022-01-08 30.50867