Reproducible Data:
df1 <- tibble(id = c("GR1","GR2"),
area = c("A1","A2"),
date1 = as.Date(c("2022-01-01","2022-01-02")),
date2 = as.Date(c("2022-01-06","2022-01-08")))
set.seed(543)
df2 <- tibble(date3 = seq(as.Date("2022-01-01"), as.Date("2022-01-09"), "days"),
temperature =runif(9, min = 28, max = 33),
area = c("A1","A2","A1","A2","A1","A2","A1","A2","A1"))
Hello, I want to create a column in df1 with the average temperature resulting in a filter from df2. (In the real data frames I have 1036 rows in df1 and 26192 rows in df2. )
I tried this approach, but it doesn't work as I thought
df3 <- df1 %>%
group_by(area) %>%
mutate(average_temp = mean(filter(.data = df2, date3 >= df1$date1 & date3 <= df1$date2 & area == df1$area)$temperature))
I get this error
Warning messages:
1: Problem while computing average_temp = mean(...)
.
i longer object length is not a multiple of shorter object length
The expected result is
id | area | date1 | date2 | average_temp |
---|---|---|---|---|
GR1 | A1 | 2022-01-01 | 2022-02-12 | 31.58708 |
GR2 | A2 | 2022-01-02 | 2022-02-11 | 30.50867 |
This chunk of code by itself gives the expected result. So the problem must be something I'm not seeing in the iteration of rows withing the logic in mutate and dplyr syntax.
mean(filter(.data = df2, date3 >= df1$date1[2] & date3 <= df1$date2[2] & area == df1$area[2])$temperature)
CodePudding user response:
This is a non-equi or range-based join. Unfortunately, dplyr
by itself cannot do that, so we need the help of another package. Options below:
fuzzyjoin
fuzzyjoin::fuzzy_left_join(
df1, df2,
by = c("area", date1="date3", date2="date3"),
match_fun=list(`==`, `<=`, `>=`)
) %>%
group_by(id, date1, date2) %>%
summarize(
area = area.x[1],
avg = mean(temperature)
) %>%
ungroup()
# `summarise()` has grouped output by 'id', 'date1'. You can override using the `.groups` argument.
# # A tibble: 2 x 5
# id date1 date2 area avg
# <chr> <date> <date> <chr> <dbl>
# 1 GR1 2022-01-01 2022-01-06 A1 31.6
# 2 GR2 2022-01-02 2022-01-08 A2 30.5
data.table
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT1[DT2, avg := ave(i.temperature, id, FUN = mean),
on = .(area, date1 <= date3, date2 >= date3) ]
# id area date1 date2 avg
# <char> <char> <Date> <Date> <num>
# 1: GR1 A1 2022-01-01 2022-01-06 31.58708
# 2: GR2 A2 2022-01-02 2022-01-08 30.50867
(I know there's a more canonical way to do this without ave
, but I ran out of time ...)
sqldf
# library(sqldf) # not required to load, per se
sqldf::sqldf(
"select df1.id, df1.area, df1.date1, df1.date2,
avg(df2.temperature) as avg
from df1
left join df2 on df1.area=df2.area
and df2.date3 between df1.date1 and df1.date2
group by df1.id, df1.area, df1.date1, df1.date2")
# id area date1 date2 avg
# 1 GR1 A1 2022-01-01 2022-01-06 31.58708
# 2 GR2 A2 2022-01-02 2022-01-08 30.50867