How to impute data based on subgroup and date in R-CodePudding

I have a dataset with meteorological information. It includes the variables:

DATE: measured every hour along year 2020
STATION: code of the meteorological station where information was gathered
CITY: name of the city where the station is placed
10 numerical variables like temperature, precipitation, wind speed etc

The problem is that not all the stations measure the same variables, so there are a lot of missing values. Also, in stations with a specific sensor, not all dates are recorded because the sensor may be offline or fail some of the dates.

What I want (but I do not know how to achieve) is: given some missing data in a station, fill the missing values with the mean of the values from the other stations IN THE SAME CITY.

So for example consider this minimal code example:

df = tibble(
  DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
  STATION = rep(c('A', 'B', 'C', 'D'), each=5),
  CITY = rep(c('LONDON', 'PARIS'), each=10),
  var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)

The first observation belongs to STATION A and CITY LONDON and is missing. But STATION B is also in LONDON and in that day it has a value 6, so I would expect that missing value to be imputed with 6.

Row 8 belongs to STATION B and CITY LONDON and is missing. But STATION A in that day has a value 4, so I would expect that missing value to be imputed with 4. And so on, the expected output would be:

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20

Be aware that in reality there are 5 different cities with 4 different stations each, with observations taken every hour.

CodePudding user response：

left_join(
  df, 
  df %>% group_by(DATE,CITY) %>% summarize(city_mean = mean(var,na.rm=T)), 
  by=c("DATE","CITY")
) %>% 
  mutate(var = if_else(is.na(var), city_mean, var)) %>% 
  select(!city_mean)

If you don't like the join approach, you can also do this:

df %>% 
  rowwise() %>% 
  mutate(var=if_else(
    is.na(var),
    mean(df[df$CITY==CITY & df$DATE==DATE,]$var,na.rm=T),
    var))

Output (under either approach):

   DATE                STATION CITY     var
   <dttm>              <chr>   <chr>  <dbl>
 1 2020-01-01 12:00:00 A       LONDON     6
 2 2020-01-02 12:00:00 A       LONDON     7
 3 2020-01-03 12:00:00 A       LONDON     8
 4 2020-01-04 12:00:00 A       LONDON     4
 5 2020-01-05 12:00:00 A       LONDON     5
 6 2020-01-01 12:00:00 B       LONDON     6
 7 2020-01-02 12:00:00 B       LONDON     7
 8 2020-01-03 12:00:00 B       LONDON     8
 9 2020-01-04 12:00:00 B       LONDON     4
10 2020-01-05 12:00:00 B       LONDON    10
11 2020-01-01 12:00:00 C       PARIS     16
12 2020-01-02 12:00:00 C       PARIS     17
13 2020-01-03 12:00:00 C       PARIS     13
14 2020-01-04 12:00:00 C       PARIS     14
15 2020-01-05 12:00:00 C       PARIS     15
16 2020-01-01 12:00:00 D       PARIS     16
17 2020-01-02 12:00:00 D       PARIS     17
18 2020-01-03 12:00:00 D       PARIS     18
19 2020-01-04 12:00:00 D       PARIS     19
20 2020-01-05 12:00:00 D       PARIS     20