Home > Software engineering >  How to impute data based on subgroup and date in R
How to impute data based on subgroup and date in R

Time:03-08

I have a dataset with meteorological information. It includes the variables:

  • DATE: measured every hour along year 2020
  • STATION: code of the meteorological station where information was gathered
  • CITY: name of the city where the station is placed
  • 10 numerical variables like temperature, precipitation, wind speed etc

The problem is that not all the stations measure the same variables, so there are a lot of missing values. Also, in stations with a specific sensor, not all dates are recorded because the sensor may be offline or fail some of the dates.

What I want (but I do not know how to achieve) is: given some missing data in a station, fill the missing values with the mean of the values from the other stations IN THE SAME CITY.

So for example consider this minimal code example:

df = tibble(
  DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
  STATION = rep(c('A', 'B', 'C', 'D'), each=5),
  CITY = rep(c('LONDON', 'PARIS'), each=10),
  var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)

The first observation belongs to STATION A and CITY LONDON and is missing. But STATION B is also in LONDON and in that day it has a value 6, so I would expect that missing value to be imputed with 6.

Row 8 belongs to STATION B and CITY LONDON and is missing. But STATION A in that day has a value 4, so I would expect that missing value to be imputed with 4. And so on, the expected output would be:

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

Be aware that in reality there are 5 different cities with 4 different stations each, with observations taken every hour.

CodePudding user response:

left_join(
  df, 
  df %>% group_by(DATE,CITY) %>% summarize(city_mean = mean(var,na.rm=T)), 
  by=c("DATE","CITY")
) %>% 
  mutate(var = if_else(is.na(var), city_mean, var)) %>% 
  select(!city_mean)

If you don't like the join approach, you can also do this:

df %>% 
  rowwise() %>% 
  mutate(var=if_else(
    is.na(var),
    mean(df[df$CITY==CITY & df$DATE==DATE,]$var,na.rm=T),
    var))

Output (under either approach):

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20
  • Related