Home > Software engineering >  How can I utilise an incremental subset within a for loop in R?
How can I utilise an incremental subset within a for loop in R?

Time:10-20

I have a dataset that contains ~40,000 data points on chicken weights over a crop cycle in various sheds(called houses). The main variables are: House Number (1:4), Days in the Crop (Continuous from 0.000001 days up to ~39 days, though some houses end before 39) and Weight in grams (though for the first 0.5-0.8 days the weight is NA since no readings are taken, but these still need to be kept to show growth from day 0).

I'm trying to use a for loop to calculate the average weight each day on a per house basis, and then I want to store that data in a new dataframe. I managed to successfully do this initially by subsetting the data into separate houses and then using the below loop:

house_1 <- subset(df,house==1)

daily_averages <- data.frame(Day = 1:39, Average=NA)
for (i in 1:1){
            lower = 0
            upper = 1
            weight = house_1$weight
            days = house_1$days_in_crop
                        for (meanroll in 1:39) {
                          meanroll_val = mean(weight[days > lower & days < upper],na.rm=TRUE)
                          daily_averages$Average[meanroll] <- meanroll_val
                          lower=lower 1
                          upper=upper 1
              }
}

This gives me a dataframe containing the average for each day, but only for House 1. Where Lower is the lower value of the day e.g. Start at 0 and Upper is the upper value of the day e.g. end of day 0 is 1 so any value between 0 and 1 is the first day. The i in 1:1 was so I didn't need to reset the lower and upper everytime I run it but also to allow me to hopefully convert it to using all 4 houses at once.

After this I tried building a second for loop that would store all averages for each day for each house in a single dataframe, what I have so far is below:

house_ID <- unique(df$house)
daily_data <- data.frame(HouseNumber=NA, Day = rep(1:39,length(house_ID), Average= NA)
    
     for (house_no in 1:length(house_ID)){
       lower= 0
       upper = 1
       weight = df$weight
       days= df$days_in_crop
       house_N <- subset(df,house==house_no)
         with(house_N,
          for(meanroll in 1:39){
            meanroll_val = mean(weight[days > lower & days < upper],na.rm=TRUE)
            daily_data$Average[meanroll] <- meanroll_val
            lower=lower 1
            upper=upper 1
          })
     }

Instead no values are being stored, I've also tried replacing

daily_data$Average[meanroll] <- meanroll_val

with

print(meanroll_val)

To see if the issue was not iterating through the subsets, instead I receive some values, followed by many NaN, some more values then some more NaN (though the combined number of values and NaN does total 156, 4*39)

Is there a better way to achieve this? I could just copy the output of the first for loop for each house, but in the future I'm likely to have many more houses so it will be more practical to have the second for loop

ETA: I also realise houses are not being assigned in that second for loop, but since it's also not working when using Print instead of storing the data, I don't think that's the issue

ETA: output of

dput(head(df, 20))

is:

structure(list(house = c(3L, 1L, 4L, 2L, 3L, 2L, 4L, 1L, 3L, 
1L, 4L, 2L, 2L, 3L, 4L, 1L, 3L, 1L, 4L, 2L), days_in_crop = c(0.000856481, 
0.001435185, 0.002025463, 0.003252315, 0.003657407, 0.003738426, 
0.004479167, 0.005451389, 0.007766204, 0.008333333, 0.008912037, 
0.010138889, 0.010497685, 0.010555556, 0.011365741, 0.012361111, 
0.01462963, 0.015231481, 0.015798611, 0.017025463), weight = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, 
20L), class = "data.frame")

A subset containing only days over 9 (note value is 9.1 since chickens tend to be sleeping at the start of the day so are not on sensors):

days_9_39 <- subset(df,days_in_crop > 9.1)
dput(head(days_9_39, 20))

structure(list(house = c(4L, 3L, 3L, 1L, 4L, 1L, 2L, 2L, 3L, 
3L, 1L, 4L, 1L, 2L, 2L, 3L, 4L, 1L, 4L, 1L), days_in_crop = c(9.101712963, 
9.101944444, 9.102847222, 9.10318287, 9.103229167, 9.106064815, 
9.106145833, 9.1065625, 9.10837963, 9.108900463, 9.109849537, 
9.110162037, 9.112581019, 9.113032407, 9.113159722, 9.114247685, 
9.11462963, 9.116435185, 9.117118056, 9.119236111), weight = c(244L, 
255L, 249L, 260L, 259L, 276L, NA, 260L, 254L, 254L, 244L, 265L, 
268L, NA, 300L, 250L, 239L, 255L, 251L, 258L)), row.names = 10262:10281, class = "data.frame")

CodePudding user response:

Here is a base R solution.
by splits the data by house, then aggregates by the integer part of days_in_crop.

res <- by(days_9_39, days_9_39$house, \(X) {
  X$days_in_crop <- floor(X$days_in_crop)
  agg <- aggregate(weight ~ days_in_crop, X, mean, na.rm = TRUE)
  agg$house <- X$house[1]
  agg[c(3, 1, 2)]
})

res <- do.call(rbind, res)
res
#>   house days_in_crop   weight
#> 1     1            9 260.1667
#> 2     2            9 280.0000
#> 3     3            9 252.4000
#> 4     4            9 251.6000

Created on 2022-10-20 with reprex v2.0.2


Edit

A simpler solution is to aggregate weight by all other variables, with integer days.
The changed days_in_crop is local only.

res2 <- local({
  days_9_39$days_in_crop <- floor(days_9_39$days_in_crop)
  aggregate(weight ~ ., days_9_39, mean, na.rm = TRUE)
})
res2
#>   house days_in_crop   weight
#> 1     1            9 260.1667
#> 2     2            9 280.0000
#> 3     3            9 252.4000
#> 4     4            9 251.6000

all.equal(res, res2, check.attributes = FALSE)
#> [1] TRUE

Created on 2022-10-20 with reprex v2.0.2


Edit 2

To include other aggregate statistics, create a temporary, local function with all wanted statistics but note that the results column is a matrix, the result must be slightly processed before returning.

res3 <- local({
  f <- function(x) {
    c(mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE))
  }
  days_9_39$days_in_crop <- floor(days_9_39$days_in_crop)
  agg <- aggregate(weight ~ ., days_9_39, f)
  cbind(agg[-ncol(agg)], agg[[ncol(agg)]])
})
res3
#>   house days_in_crop     mean        sd
#> 1     1            9 260.1667 10.998485
#> 2     2            9 280.0000 28.284271
#> 3     3            9 252.4000  2.701851
#> 4     4            9 251.6000 10.620734

Created on 2022-10-20 with reprex v2.0.2

  • Related