I have a dataset that contains ~40,000 data points on chicken weights over a crop cycle in various sheds(called houses). The main variables are: House Number (1:4), Days in the Crop (Continuous from 0.000001 days up to ~39 days, though some houses end before 39) and Weight in grams (though for the first 0.5-0.8 days the weight is NA since no readings are taken, but these still need to be kept to show growth from day 0).
I'm trying to use a for loop to calculate the average weight each day on a per house basis, and then I want to store that data in a new dataframe. I managed to successfully do this initially by subsetting the data into separate houses and then using the below loop:
house_1 <- subset(df,house==1)
daily_averages <- data.frame(Day = 1:39, Average=NA)
for (i in 1:1){
lower = 0
upper = 1
weight = house_1$weight
days = house_1$days_in_crop
for (meanroll in 1:39) {
meanroll_val = mean(weight[days > lower & days < upper],na.rm=TRUE)
daily_averages$Average[meanroll] <- meanroll_val
lower=lower 1
upper=upper 1
}
}
This gives me a dataframe containing the average for each day, but only for House 1. Where Lower is the lower value of the day e.g. Start at 0 and Upper is the upper value of the day e.g. end of day 0 is 1 so any value between 0 and 1 is the first day. The i in 1:1 was so I didn't need to reset the lower and upper everytime I run it but also to allow me to hopefully convert it to using all 4 houses at once.
After this I tried building a second for loop that would store all averages for each day for each house in a single dataframe, what I have so far is below:
house_ID <- unique(df$house)
daily_data <- data.frame(HouseNumber=NA, Day = rep(1:39,length(house_ID), Average= NA)
for (house_no in 1:length(house_ID)){
lower= 0
upper = 1
weight = df$weight
days= df$days_in_crop
house_N <- subset(df,house==house_no)
with(house_N,
for(meanroll in 1:39){
meanroll_val = mean(weight[days > lower & days < upper],na.rm=TRUE)
daily_data$Average[meanroll] <- meanroll_val
lower=lower 1
upper=upper 1
})
}
Instead no values are being stored, I've also tried replacing
daily_data$Average[meanroll] <- meanroll_val
with
print(meanroll_val)
To see if the issue was not iterating through the subsets, instead I receive some values, followed by many NaN, some more values then some more NaN (though the combined number of values and NaN does total 156, 4*39)
Is there a better way to achieve this? I could just copy the output of the first for loop for each house, but in the future I'm likely to have many more houses so it will be more practical to have the second for loop
ETA: I also realise houses are not being assigned in that second for loop, but since it's also not working when using Print instead of storing the data, I don't think that's the issue
ETA: output of
dput(head(df, 20))
is:
structure(list(house = c(3L, 1L, 4L, 2L, 3L, 2L, 4L, 1L, 3L,
1L, 4L, 2L, 2L, 3L, 4L, 1L, 3L, 1L, 4L, 2L), days_in_crop = c(0.000856481,
0.001435185, 0.002025463, 0.003252315, 0.003657407, 0.003738426,
0.004479167, 0.005451389, 0.007766204, 0.008333333, 0.008912037,
0.010138889, 0.010497685, 0.010555556, 0.011365741, 0.012361111,
0.01462963, 0.015231481, 0.015798611, 0.017025463), weight = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA,
20L), class = "data.frame")
A subset containing only days over 9 (note value is 9.1 since chickens tend to be sleeping at the start of the day so are not on sensors):
days_9_39 <- subset(df,days_in_crop > 9.1)
dput(head(days_9_39, 20))
structure(list(house = c(4L, 3L, 3L, 1L, 4L, 1L, 2L, 2L, 3L,
3L, 1L, 4L, 1L, 2L, 2L, 3L, 4L, 1L, 4L, 1L), days_in_crop = c(9.101712963,
9.101944444, 9.102847222, 9.10318287, 9.103229167, 9.106064815,
9.106145833, 9.1065625, 9.10837963, 9.108900463, 9.109849537,
9.110162037, 9.112581019, 9.113032407, 9.113159722, 9.114247685,
9.11462963, 9.116435185, 9.117118056, 9.119236111), weight = c(244L,
255L, 249L, 260L, 259L, 276L, NA, 260L, 254L, 254L, 244L, 265L,
268L, NA, 300L, 250L, 239L, 255L, 251L, 258L)), row.names = 10262:10281, class = "data.frame")
CodePudding user response:
Here is a base R solution.
by
splits the data by house, then aggregate
s by the integer part of days_in_crop
.
res <- by(days_9_39, days_9_39$house, \(X) {
X$days_in_crop <- floor(X$days_in_crop)
agg <- aggregate(weight ~ days_in_crop, X, mean, na.rm = TRUE)
agg$house <- X$house[1]
agg[c(3, 1, 2)]
})
res <- do.call(rbind, res)
res
#> house days_in_crop weight
#> 1 1 9 260.1667
#> 2 2 9 280.0000
#> 3 3 9 252.4000
#> 4 4 9 251.6000
Created on 2022-10-20 with reprex v2.0.2
Edit
A simpler solution is to aggregate weight by all other variables, with integer days.
The changed days_in_crop
is local
only.
res2 <- local({
days_9_39$days_in_crop <- floor(days_9_39$days_in_crop)
aggregate(weight ~ ., days_9_39, mean, na.rm = TRUE)
})
res2
#> house days_in_crop weight
#> 1 1 9 260.1667
#> 2 2 9 280.0000
#> 3 3 9 252.4000
#> 4 4 9 251.6000
all.equal(res, res2, check.attributes = FALSE)
#> [1] TRUE
Created on 2022-10-20 with reprex v2.0.2
Edit 2
To include other aggregate statistics, create a temporary, local function with all wanted statistics but note that the results column is a matrix, the result must be slightly processed before returning.
res3 <- local({
f <- function(x) {
c(mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE))
}
days_9_39$days_in_crop <- floor(days_9_39$days_in_crop)
agg <- aggregate(weight ~ ., days_9_39, f)
cbind(agg[-ncol(agg)], agg[[ncol(agg)]])
})
res3
#> house days_in_crop mean sd
#> 1 1 9 260.1667 10.998485
#> 2 2 9 280.0000 28.284271
#> 3 3 9 252.4000 2.701851
#> 4 4 9 251.6000 10.620734
Created on 2022-10-20 with reprex v2.0.2