Codes for the dfs are at the end.
I have two dataframes. The first df is meteo data from 3 different stations :
site date temp
X 2021-01-01 14
X 2021-01-02 NA
X 2021-01-03 10
X 2021-01-04 14
X 2021-01-05 10
X 2021-01-06 10
X 2021-01-07 13
X 2021-01-08 12
X 2021-01-09 13
X 2021-01-10 7
X 2021-01-11 9
X 2021-01-12 6
X 2021-01-13 8
Y 2021-01-01 10
Y 2021-01-02 14
Y 2021-01-03 5
Y 2021-01-04 7
Y 2021-01-05 7
Y 2021-01-06 9
Y 2021-01-07 6
Y 2021-01-08 12
Y 2021-01-09 10
Y 2021-01-10 9
Y 2021-01-11 13
Y 2021-01-12 13
Y 2021-01-13 NA
Y 2021-01-14 8
Y 2021-01-15 11
Y 2021-01-16 5
Y 2021-01-17 11
Y 2021-01-18 13
Y 2021-01-19 11
Y 2021-01-20 9
Y 2021-01-21 9
Y 2021-01-22 5
Y 2021-01-23 6
Y 2021-01-24 14
Y 2021-01-25 10
Y 2021-01-26 7
Z 2021-01-01 9
Z 2021-01-02 NA
Z 2021-01-03 12
Z 2021-01-04 6
Z 2021-01-05 5
Z 2021-01-06 7
Z 2021-01-07 7
Z 2021-01-08 5
Z 2021-01-09 7
Z 2021-01-10 7
Z 2021-01-11 15
Z 2021-01-12 8
Z 2021-01-13 5
Z 2021-01-14 6
Z 2021-01-15 5
Z 2021-01-16 12
Z 2021-01-17 8
Z 2021-01-18 7
Z 2021-01-19 6
Z 2021-01-20 13
Z 2021-01-21 14
Z 2021-01-22 8
Z 2021-01-23 11
Z 2021-01-24 7
The second df consists of observations made on the same site than the meteo stations.
There is a trap at each station. Every couple days, the trap is emptied and the different species that were trapped are counted separately. For each site in df2
, the date of pose
is always the day after the date of withdrawal
of the precedent entree (row). In this exemple, the species are in the obs
column. They are named A
, B
, C
, D
, F
and G
. freq
is the number of individuals that were trapped for that specie.
site pose withdrawal obs freq
X 2021-01-01 2021-01-03 A 31
X 2021-01-01 2021-01-03 B 42
X 2021-01-04 2021-01-05 A 14
X 2021-01-06 2021-01-13 D 16
X 2021-01-06 2021-01-13 F 36
Y 2021-01-01 2021-01-04 G 49
Y 2021-01-01 2021-01-04 A 29
Y 2021-01-01 2021-01-04 C 45
Y 2021-01-05 2021-01-14 D 25
Y 2021-01-05 2021-01-14 A 50
Y 2021-01-15 2021-01-14 B 40
Y 2021-01-19 2021-01-26 B 39
Z 2021-01-01 2021-01-03 C 25
Z 2021-01-04 2021-01-05 F 3
Z 2021-01-04 2021-01-05 B 16
Z 2021-01-06 2021-01-14 C 19
Z 2021-01-15 2021-01-19 A 12
Z 2021-01-15 2021-01-19 B 26
Z 2021-01-15 2021-01-19 F 2
Z 2021-01-20 2021-01-24 A 24
I want to add a mean_T
column to df2
where I would store the mean temperature for each entree in df2
.
For ID = 1
, the mean temperature would be calculated with the entrees 2021-01-01
, 2021-01-02
and 2021-01-03
in df1
, where site = 'X'
.
With simpler dfs, I used this code the get the mean temperature. It works if I only have one entree per date, per site in df2
, which is not the case.
df1 <- split(df1, with(df1, site), subset(df1, select = -site) )
df1 <- lapply(df1, function(x) x[(names(x) %in% c("ID", "date", "temp"))])
df2 <- split(df2, with(df2, site), subset(df2, select = -site) )
df2 <- lapply(df2, function(x) x[(names(x) %in% c("ID", "pose", "withdrawal"))])
library(dplyr)
library(tidyr)
Then, this code gave me the mean temperature. Credits go to @TarJae :
mean_X <- df2$X %>%
pivot_longer(-ID, values_to = "date") %>%
full_join(df1$X, by= "date") %>%
arrange(date) %>%
fill(ID, .direction = "down") %>%
group_by(ID) %>%
summarise(mean_T = mean(temp, na.rm = TRUE)) %>%
left_join(df2$X, by="ID")
This chunk of code also worked credits go to @Jon Spring :
df2 %>%
mutate(days = (withdrawal - pose 1) %>% as.integer) %>%
tidyr::uncount(days, .id = "row") %>%
transmute(ID, date = pose row - 1) %>%
left_join(df1) %>%
group_by(ID) %>%
summarize(mean_T = mean(temp)) %>%
right_join(df2)
Here is the code to generate the dfs :
df1 <- data.frame( site = c(rep('X', 13), rep('Y', 26), rep('Z', 24) ) ,
date = c( seq( as.Date("2021-01-01"), by="day", length.out=13 ),
seq( as.Date("2021-01-01"), by="day", length.out=26 ),
seq( as.Date("2021-01-01"), by="day", length.out=24 )) ,
temp = c(14, NA, 10, 14, 10, 10, 13, 12, 13, 7, 9, 6, 8, 10, 14, 5, 7, 7, 9, 6, 12,
10, 9, 13, 13, NA, 8, 11, 5, 11, 13, 11, 9, 9, 5, 6, 14, 10, 7, 9, NA, 12,
6, 5, 7, 7, 5, 7, 7, 15, 8, 5, 6, 5, 12, 8, 7, 6, 13, 14, 8, 11, 7) )
df2 <- data.frame( site = c( rep('X', 5), rep('Y', 7), rep('Z', 8) ) ,
pose = as.Date( c("2021-01-01", "2021-01-01", "2021-01-04", "2021-01-06",
"2021-01-06", "2021-01-01", "2021-01-01", "2021-01-01",
"2021-01-05", "2021-01-05", "2021-01-15", "2021-01-19" ,
"2021-01-01", "2021-01-04", "2021-01-04", "2021-01-06",
"2021-01-15", "2021-01-15", "2021-01-15", "2021-01-20") ) ,
withdrawal = as.Date( c( "2021-01-03", "2021-01-03", "2021-01-05", "2021-01-13",
"2021-01-13", "2021-01-04", "2021-01-04", "2021-01-04",
"2021-01-14", "2021-01-14", "2021-01-14", "2021-01-26" ,
"2021-01-03", "2021-01-05", "2021-01-05", "2021-01-14",
"2021-01-19", "2021-01-19", "2021-01-19", "2021-01-24" ) ) ,
obs = c( 'A', 'B', 'A', 'D', 'F', 'G', 'A', 'C', 'D', 'A', 'B', 'B' ,
'C', 'F', 'B', 'C', 'A', 'B', 'F', 'A') ,
freq = c(31, 42, 14, 16, 36, 49, 29, 45, 25, 50, 40, 39, 25, 3, 16, 19, 12, 26, 2, 24) )
df2 <- cbind(ID = 1:nrow(df2), df2)
English is not my first language. If something doesn't make sense, fell free to let me know in the comments.
CodePudding user response:
First I expand df2
to make a dataset with one row per day
df3 <- do.call(rbind,by(df2,
list(df2$ID),
function(d) data.frame(d,dates=d$pose:d$withdrawal)))
Now I merge df1
into this new dataset. I first need to convert the date to a numeric to match df3
df1$dates <- as.numeric(df1$date)
df4 <- merge(df1, df3,by=c("site", "dates"))
Now I can aggregate the new dataset by taking the mean temp over each day
aggregate(data=df4, temp ~ freq site obs pose withdrawal ID, mean)
freq site obs pose withdrawal ID temp
1 31 X A 2021-01-01 2021-01-03 1 12.000000
2 42 X B 2021-01-01 2021-01-03 2 12.000000
3 14 X A 2021-01-04 2021-01-05 3 12.000000
4 16 X D 2021-01-06 2021-01-13 4 9.750000
5 36 X F 2021-01-06 2021-01-13 5 9.750000
6 49 Y G 2021-01-01 2021-01-04 6 9.000000
7 29 Y A 2021-01-01 2021-01-04 7 9.000000
8 45 Y C 2021-01-01 2021-01-04 8 9.000000
9 25 Y D 2021-01-05 2021-01-14 9 9.666667
10 50 Y A 2021-01-05 2021-01-14 10 9.666667
11 40 Y B 2021-01-15 2021-01-14 11 9.500000
12 39 Y B 2021-01-19 2021-01-26 12 8.875000
13 25 Z C 2021-01-01 2021-01-03 13 10.500000
14 3 Z F 2021-01-04 2021-01-05 14 5.500000
15 16 Z B 2021-01-04 2021-01-05 15 5.500000
16 19 Z C 2021-01-06 2021-01-14 16 7.444444
17 12 Z A 2021-01-15 2021-01-19 17 7.600000
18 26 Z B 2021-01-15 2021-01-19 18 7.600000
19 2 Z F 2021-01-15 2021-01-19 19 7.600000
20 24 Z A 2021-01-20 2021-01-24 20 10.600000