Home > Software design >  Looking for a way of creating a cumulative average for individuals within a dataset
Looking for a way of creating a cumulative average for individuals within a dataset

Time:07-08

I have a dataset with multiple individuals (Cow_ID) and their daily difference in temperature looking like this:

df<-data.frame("Cow_ID"=c("192","192","192","192","201","201","201","201","652","652","652","652"),"Day"=c("1","2","3","4","1","2","3","4","1","2","3","4"),"Temp_Diff"=c("0.2","0.1","na","0.8","0.4","0.1","0.7","na","0.1","0.3","na","0.8"))

I want to calculate a cumulative mean for each individual in a new column. I need it to exclude the na's too.

I managed to get some code working to run this for the whole data set, but I could not find a way to get it to stop at the end of an individual and start again for the next one.

df<-df%>%mutate(new=replace(Temp_Diff,!is.na(Temp_Diff),cummean(Temp_Diff[!is.na(Temp_Diff)])))%>%fill(new)

Please excuse the sloppy coding, I am very new to this!

CodePudding user response:

If it is by 'Cow_ID', then do a group by on that column. Also, make sure to convert the "na" to NA before applying the is.na as well as convert the type as the 'Temp_Diff' column was character class as a result of "na" character element which doesn't work with cummean

library(dplyr)
df %>% 
  na_if("na") %>%
  type.convert(as.is = TRUE) %>% 
  group_by(Cow_ID) %>% 
  mutate(new = replace(Temp_Diff, !is.na(Temp_Diff), 
      cummean(Temp_Diff[!is.na(Temp_Diff)]))) %>%
  ungroup()

-output

# A tibble: 12 × 4
   Cow_ID   Day Temp_Diff    new
    <int> <int>     <dbl>  <dbl>
 1    192     1       0.2  0.2  
 2    192     2       0.1  0.15 
 3    192     3      NA   NA    
 4    192     4       0.8  0.367
 5    201     1       0.4  0.4  
 6    201     2       0.1  0.25 
 7    201     3       0.7  0.4  
 8    201     4      NA   NA    
 9    652     1       0.1  0.1  
10    652     2       0.3  0.2  
11    652     3      NA   NA    
12    652     4       0.8  0.4  

Or using data.table, convert the 'data.frame' to 'data.table' (setDT), conver the 'Temp_Diff' to numeric column with as.numeric returns NA for "na" while doing the force conversion, then create the 'new' column by specifying a logical condition in i to subset only the non-NA elements, do the cummean on 'j', grouped by 'Cow_ID' and assign (:=) to a new column 'new'

library(data.table)
 setDT(df)[, Temp_Diff := as.numeric(Temp_Diff)][!is.na(Temp_Diff), 
      new := cummean(Temp_Diff), by = Cow_ID]
  • Related