I have a dataset with meteorological information. It includes the variables:
DATE
: measured every hour along year 2020STATION
: code of the meteorological station where information was gatheredCITY
: name of the city where the station is placed- 10 numerical variables like
temperature
,precipitation
,wind speed
etc
The problem is that not all the stations measure the same variables, so there are a lot of missing values. Also, in stations with a specific sensor, not all dates are recorded because the sensor may be offline or fail some of the dates.
What I want (but I do not know how to achieve) is: given some missing data in a station, fill the missing values with the mean of the values from the other stations IN THE SAME CITY.
So for example consider this minimal code example:
df = tibble(
DATE = lubridate::ymd_hm(rep(c('2020-01-01 12:00', '2020-01-02 12:00','2020-01-03 12:00','2020-01-04 12:00','2020-01-05 12:00'), 4)),
STATION = rep(c('A', 'B', 'C', 'D'), each=5),
CITY = rep(c('LONDON', 'PARIS'), each=10),
var = c(NA,NA,NA,4,5,6,7,8,NA,10,NA,NA,13,14,15,16,17,18,19,20)
)
The first observation belongs to STATION
A and CITY LONDON
and is missing. But STATION
B is also in LONDON
and in that day it has a value 6, so I would expect that missing value to be imputed with 6.
Row 8 belongs to STATION
B and CITY LONDON
and is missing. But STATION
A in that day has a value 4, so I would expect that missing value to be imputed with 4. And so on, the expected output would be:
DATE STATION CITY var
<dttm> <chr> <chr> <dbl>
1 2020-01-01 12:00:00 A LONDON 6
2 2020-01-02 12:00:00 A LONDON 7
3 2020-01-03 12:00:00 A LONDON 8
4 2020-01-04 12:00:00 A LONDON 4
5 2020-01-05 12:00:00 A LONDON 5
6 2020-01-01 12:00:00 B LONDON 6
7 2020-01-02 12:00:00 B LONDON 7
8 2020-01-03 12:00:00 B LONDON 8
9 2020-01-04 12:00:00 B LONDON 4
10 2020-01-05 12:00:00 B LONDON 10
11 2020-01-01 12:00:00 C PARIS 16
12 2020-01-02 12:00:00 C PARIS 17
13 2020-01-03 12:00:00 C PARIS 13
14 2020-01-04 12:00:00 C PARIS 14
15 2020-01-05 12:00:00 C PARIS 15
16 2020-01-01 12:00:00 D PARIS 16
17 2020-01-02 12:00:00 D PARIS 17
18 2020-01-03 12:00:00 D PARIS 18
19 2020-01-04 12:00:00 D PARIS 19
20 2020-01-05 12:00:00 D PARIS 20
Be aware that in reality there are 5 different cities with 4 different stations each, with observations taken every hour.
CodePudding user response:
left_join(
df,
df %>% group_by(DATE,CITY) %>% summarize(city_mean = mean(var,na.rm=T)),
by=c("DATE","CITY")
) %>%
mutate(var = if_else(is.na(var), city_mean, var)) %>%
select(!city_mean)
If you don't like the join approach, you can also do this:
df %>%
rowwise() %>%
mutate(var=if_else(
is.na(var),
mean(df[df$CITY==CITY & df$DATE==DATE,]$var,na.rm=T),
var))
Output (under either approach):
DATE STATION CITY var
<dttm> <chr> <chr> <dbl>
1 2020-01-01 12:00:00 A LONDON 6
2 2020-01-02 12:00:00 A LONDON 7
3 2020-01-03 12:00:00 A LONDON 8
4 2020-01-04 12:00:00 A LONDON 4
5 2020-01-05 12:00:00 A LONDON 5
6 2020-01-01 12:00:00 B LONDON 6
7 2020-01-02 12:00:00 B LONDON 7
8 2020-01-03 12:00:00 B LONDON 8
9 2020-01-04 12:00:00 B LONDON 4
10 2020-01-05 12:00:00 B LONDON 10
11 2020-01-01 12:00:00 C PARIS 16
12 2020-01-02 12:00:00 C PARIS 17
13 2020-01-03 12:00:00 C PARIS 13
14 2020-01-04 12:00:00 C PARIS 14
15 2020-01-05 12:00:00 C PARIS 15
16 2020-01-01 12:00:00 D PARIS 16
17 2020-01-02 12:00:00 D PARIS 17
18 2020-01-03 12:00:00 D PARIS 18
19 2020-01-04 12:00:00 D PARIS 19
20 2020-01-05 12:00:00 D PARIS 20