Home > Software engineering >  Apply function to subsets of dataset
Apply function to subsets of dataset

Time:09-24

I want to apply a function (in this case calculating a mean) on subsets of a dataset. The dataset includes (a variable amount of) measurements, taken on different intervals. I want to perform multiple steps, but I am a bit tangled up in the code.

  1. I want to calculate a mean of DF$value per time on a period of 1 per ID. Cannot figure out how to loop the apply() through the increments of time

  2. time needs to be renamed as year; 0-1 = year 1, 1-2 = year 2

  3. If no 'value' is found in a year, the previously calculated mean of the previous year needs to be carried forward

The function probably needs multiple arguments into an apply() on a subset, but I cannot figure out how. Also the time subset needs to cycle through a period of 1 time per ID

DF
ID    time     value
1     0.1      3
1     0.5      5
1     2.1      6
1     3.3      12
2     0.3      1
2     0.4      3
2     0.6      5
2     1.2      4
2     1.5      8
2     2.6      2
2     2.7      14
3     0.1      1.1
3     0.4      2.3
3     1.3      6
3     1.5      3
3     1.6      6

So it will turn into

DF2
ID    year     mean_value
1     1        4
1     2        4   (=carried forward)
1     3        6
1     4        12
2     1        3
2     2        6
2     3        8
3     2        5

CodePudding user response:

Here is one option -

  • Use ceiling to round up the time values.
  • for each ID and year calculate the average value.
  • Use complete to create the missing year value.
  • fill to carry forward the average value.
library(dplyr)
library(tidyr)

df %>%
  group_by(ID, year = ceiling(time)) %>%
  summarise(mean_value = mean(value)) %>%
  complete(year = min(year):max(year)) %>%
  fill(mean_value) %>%
  ungroup

#     ID  year mean_value
#  <int> <dbl>      <dbl>
#1     1     1        4  
#2     1     2        4  
#3     1     3        6  
#4     1     4       12  
#5     2     1        3  
#6     2     2        6  
#7     2     3        8  
#8     3     1        1.7
#9     3     2        5  

data

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L), time = c(0.1, 0.5, 2.1, 3.3, 0.3, 0.4, 
0.6, 1.2, 1.5, 2.6, 2.7, 0.1, 0.4, 1.3, 1.5, 1.6), value = c(3, 
5, 6, 12, 1, 3, 5, 4, 8, 2, 14, 1.1, 2.3, 6, 3, 6)), 
class = "data.frame", row.names = c(NA, -16L))
  • Related