I have dataset input
with a couple of missing values. and I have to create dataset output
with the following logic:
- If there is a missing in any of the columns
b
,c
, ord
, then check the correspondenta
column and fill up the missing with the correspondent value from that row to the specific column.
I tried to do that with _join
functions from dplyr
but was unsuccessful.
I can do it manually, but this option is off the table because I have a big dataset with multiple instances like that.
Input
library(dplyr)
input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
b = c(1:3, NA, rep(NA,4)),
c = c(21:28),
d = c(rep(NA,4), 54, NA, 34,11)) %>%
arrange(a)
Input view
# A tibble: 8 × 4
# a b c d
# <chr> <int> <int> <dbl>
#1 A 1 21 NA
#2 A NA 25 54
#3 B 2 22 NA
#4 B NA 26 NA
#5 C 3 23 NA
#6 C NA 27 34
#7 D NA 24 NA
#8 D NA 28 11
Output - expected view
# A tibble: 8 × 4
# a b c d
# <chr> <int> <int> <dbl>
# 1 A 1 21 54
# 2 A 1 25 54
# 3 B 2 22 NA
# 4 B 2 26 NA
# 5 C 3 23 34
# 6 C 3 27 34
# 7 D NA 24 11
# 8 D NA 28 11
CodePudding user response:
Use function na.locf
from package zoo
to carry the last observation forward or in the opposite direction.
suppressPackageStartupMessages(library(dplyr))
input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
b = c(1:3, NA, rep(NA,4)),
c = c(21:28),
d = c(rep(NA,4), 54, NA, 34,11)) %>%
arrange(a)
input %>%
group_by(a) %>%
mutate(across(b:d, zoo::na.locf, na.rm = FALSE)) %>%
mutate(across(b:d, zoo::na.locf, na.rm = FALSE, fromLast = TRUE))
#> # A tibble: 8 × 4
#> # Groups: a [4]
#> a b c d
#> <chr> <int> <int> <dbl>
#> 1 A 1 21 54
#> 2 A 1 25 54
#> 3 B 2 22 NA
#> 4 B 2 26 NA
#> 5 C 3 23 34
#> 6 C 3 27 34
#> 7 D NA 24 11
#> 8 D NA 28 11
Created on 2022-05-14 by the reprex package (v2.0.1)
CodePudding user response:
This is hasty imputation:
library(dplyr)
input %>%
group_by(a) %>%
mutate(across(b:d, ~ if_else(is.na(.), na.omit(.)[1], .))) %>%
ungroup()
# # A tibble: 8 x 4
# a b c d
# <chr> <int> <int> <dbl>
# 1 A 1 21 54
# 2 A 1 25 54
# 3 B 2 22 NA
# 4 B 2 26 NA
# 5 C 3 23 34
# 6 C 3 27 34
# 7 D NA 24 11
# 8 D NA 28 11
I think the group_by(a)
is fairly intuitive and makes sense. The "hasty" part of my first sentence is that we find the first non-NA
value and use it. Other imputation techniques may use the average, median, previous valid data ("locf" as in Rui's answer), or random sampling.
The mice
package specializes in imputation.