Home > Back-end >  using R to impute missing data with the mean of the available data if there was less than 10% missin
using R to impute missing data with the mean of the available data if there was less than 10% missin

Time:08-09

How to use R to impute missing data with the mean of the available data across rows if there was less than 10% missing data across rows?

CodePudding user response:

I would use {dplyr} and {naniar}

dplyr::mutate_if(iris2, ~ mean(is.na(.x)) > .1, naniar::impute_mean)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1       5.100000    3.500000          1.4         0.2     setosa
#> 2       4.900000    3.072308          1.4         0.2     setosa
#> 3       4.700000    3.200000          1.3         0.2     setosa
#> 4       5.813077    3.100000          1.5         0.2     setosa
#> 5       5.000000    3.600000          1.4         0.2     setosa
#> 6       5.400000    3.900000          1.7         0.4     setosa
#> 7       4.600000    3.400000          1.4         0.3     setosa
#> 8       5.000000    3.400000          1.5         0.2     setosa
#> 9       4.400000    2.900000          1.4         0.2     setosa
#> 10      4.900000    3.100000          1.5         0.1     setosa
#> 11      5.400000    3.700000          1.5         0.2     setosa
#> 12      4.800000    3.400000          1.6         0.2     setosa
#> 13      4.800000    3.000000          1.4         0.1     setosa
#> 14      5.813077    3.000000          1.1         0.1     setosa
#> 15      5.800000    4.000000          1.2         0.2     setosa
#> 16      5.700000    4.400000          1.5         0.4     setosa
#> 17      5.400000    3.900000          1.3         0.4     setosa
#> 18      5.100000    3.500000          1.4         0.3     setosa
#> 19      5.700000    3.800000          1.7         0.3     setosa
#> 20      5.100000    3.800000          1.5         0.3     setosa
#> 21      5.400000    3.400000          1.7         0.2     setosa
#> 22      5.100000    3.700000          1.5         0.4     setosa
#> 23      4.600000    3.600000          1.0         0.2     setosa
#> 24      5.100000    3.300000          1.7         0.5     setosa
#> 25      4.800000    3.400000          1.9         0.2     setosa
#> 26      5.000000    3.000000          1.6         0.2     setosa
#> 27      5.000000    3.400000          1.6         0.4     setosa
#> 28      5.813077    3.500000          1.5         0.2     setosa
#> 29      5.200000    3.400000          1.4         0.2     setosa
#> 30      4.700000    3.200000          1.6         0.2     setosa
#> 31      4.800000    3.100000          1.6         0.2     setosa
#> 32      5.400000    3.400000          1.5         0.4     setosa
#> 33      5.200000    4.100000          1.5         0.1     setosa
#> 34      5.500000    4.200000          1.4         0.2     setosa
#> 35      4.900000    3.100000          1.5         0.2     setosa
#> 36      5.000000    3.200000          1.2         0.2     setosa
#> 37      5.500000    3.500000          1.3         0.2     setosa
#> 38      4.900000    3.600000          1.4         0.1     setosa
#> 39      4.400000    3.000000          1.3         0.2     setosa
#> 40      5.813077    3.400000          1.5         0.2     setosa
#> 41      5.000000    3.072308          1.3         0.3     setosa
#> 42      4.500000    3.072308          1.3         0.3     setosa
#> 43      4.400000    3.072308          1.3         0.2     setosa
#> 44      5.000000    3.500000          1.6         0.6     setosa
#> 45      5.100000    3.800000          1.9         0.4     setosa
#> 46      4.800000    3.000000          1.4         0.3     setosa
#> 47      5.100000    3.800000          1.6         0.2     setosa
#> 48      4.600000    3.072308          1.4         0.2     setosa
#> 49      5.300000    3.072308          1.5         0.2     setosa
#> 50      5.000000    3.300000          1.4         0.2     setosa
#> 51      7.000000    3.072308          4.7         1.4 versicolor
#> 52      6.400000    3.200000          4.5         1.5 versicolor
#> 53      6.900000    3.100000          4.9         1.5 versicolor
#> 54      5.500000    2.300000          4.0         1.3 versicolor
#> 55      6.500000    2.800000          4.6         1.5 versicolor
#> 56      5.700000    2.800000          4.5         1.3 versicolor
#> 57      6.300000    3.300000          4.7         1.6 versicolor
#> 58      4.900000    2.400000          3.3         1.0 versicolor
#> 59      6.600000    2.900000          4.6         1.3 versicolor
#> 60      5.200000    2.700000          3.9         1.4 versicolor
#> 61      5.000000    2.000000          3.5         1.0 versicolor
#> 62      5.813077    3.000000          4.2         1.5 versicolor
#> 63      6.000000    2.200000          4.0         1.0 versicolor
#> 64      6.100000    2.900000          4.7         1.4 versicolor
#> 65      5.600000    2.900000          3.6         1.3 versicolor
#> 66      6.700000    3.072308          4.4         1.4 versicolor
#> 67      5.600000    3.000000          4.5         1.5 versicolor
#> 68      5.800000    2.700000          4.1         1.0 versicolor
#> 69      6.200000    2.200000          4.5         1.5 versicolor
#> 70      5.813077    2.500000          3.9         1.1 versicolor
#> 71      5.900000    3.200000          4.8         1.8 versicolor
#> 72      6.100000    3.072308          4.0         1.3 versicolor
#> 73      6.300000    2.500000          4.9         1.5 versicolor
#> 74      6.100000    2.800000          4.7         1.2 versicolor
#> 75      6.400000    2.900000          4.3         1.3 versicolor
#> 76      6.600000    3.000000          4.4         1.4 versicolor
#> 77      6.800000    2.800000          4.8         1.4 versicolor
#> 78      6.700000    3.000000          5.0         1.7 versicolor
#> 79      5.813077    3.072308          4.5         1.5 versicolor
#> 80      5.813077    2.600000          3.5         1.0 versicolor
#> 81      5.500000    2.400000          3.8         1.1 versicolor
#> 82      5.500000    2.400000          3.7         1.0 versicolor
#> 83      5.800000    2.700000          3.9         1.2 versicolor
#> 84      6.000000    2.700000          5.1         1.6 versicolor
#> 85      5.400000    3.000000          4.5         1.5 versicolor
#> 86      6.000000    3.400000          4.5         1.6 versicolor
#> 87      6.700000    3.072308          4.7         1.5 versicolor
#> 88      6.300000    2.300000          4.4         1.3 versicolor
#> 89      5.600000    3.000000          4.1         1.3 versicolor
#> 90      5.813077    2.500000          4.0         1.3 versicolor
#> 91      5.500000    2.600000          4.4         1.2 versicolor
#> 92      6.100000    3.000000          4.6         1.4 versicolor
#> 93      5.800000    3.072308          4.0         1.2 versicolor
#> 94      5.000000    2.300000          3.3         1.0 versicolor
#> 95      5.600000    2.700000          4.2         1.3 versicolor
#> 96      5.700000    3.000000          4.2         1.2 versicolor
#> 97      5.700000    2.900000          4.2         1.3 versicolor
#> 98      5.813077    2.900000          4.3         1.3 versicolor
#> 99      5.100000    2.500000          3.0         1.1 versicolor
#> 100     5.700000    2.800000          4.1         1.3 versicolor
#> 101     5.813077    3.300000          6.0         2.5  virginica
#> 102     5.800000    3.072308          5.1         1.9  virginica
#> 103     5.813077    3.000000          5.9         2.1  virginica
#> 104     6.300000    2.900000          5.6         1.8  virginica
#> 105     6.500000    3.000000          5.8         2.2  virginica
#> 106     7.600000    3.000000          6.6         2.1  virginica
#> 107     4.900000    2.500000          4.5         1.7  virginica
#> 108     7.300000    3.072308          6.3         1.8  virginica
#> 109     6.700000    2.500000          5.8         1.8  virginica
#> 110     7.200000    3.600000          6.1         2.5  virginica
#> 111     5.813077    3.200000          5.1         2.0  virginica
#> 112     6.400000    2.700000          5.3         1.9  virginica
#> 113     6.800000    3.000000          5.5         2.1  virginica
#> 114     5.700000    2.500000          5.0         2.0  virginica
#> 115     5.800000    3.072308          5.1         2.4  virginica
#> 116     5.813077    3.200000          5.3         2.3  virginica
#> 117     6.500000    3.072308          5.5         1.8  virginica
#> 118     7.700000    3.800000          6.7         2.2  virginica
#> 119     7.700000    2.600000          6.9         2.3  virginica
#> 120     6.000000    2.200000          5.0         1.5  virginica
#> 121     6.900000    3.200000          5.7         2.3  virginica
#> 122     5.600000    3.072308          4.9         2.0  virginica
#> 123     7.700000    3.072308          6.7         2.0  virginica
#> 124     6.300000    2.700000          4.9         1.8  virginica
#> 125     6.700000    3.300000          5.7         2.1  virginica
#> 126     5.813077    3.200000          6.0         1.8  virginica
#> 127     6.200000    2.800000          4.8         1.8  virginica
#> 128     6.100000    3.000000          4.9         1.8  virginica
#> 129     6.400000    2.800000          5.6         2.1  virginica
#> 130     7.200000    3.000000          5.8         1.6  virginica
#> 131     7.400000    3.072308          6.1         1.9  virginica
#> 132     5.813077    3.800000          6.4         2.0  virginica
#> 133     5.813077    3.072308          5.6         2.2  virginica
#> 134     6.300000    2.800000          5.1         1.5  virginica
#> 135     6.100000    2.600000          5.6         1.4  virginica
#> 136     7.700000    3.000000          6.1         2.3  virginica
#> 137     5.813077    3.400000          5.6         2.4  virginica
#> 138     6.400000    3.100000          5.5         1.8  virginica
#> 139     6.000000    3.000000          4.8         1.8  virginica
#> 140     6.900000    3.100000          5.4         2.1  virginica
#> 141     6.700000    3.100000          5.6         2.4  virginica
#> 142     6.900000    3.100000          5.1         2.3  virginica
#> 143     5.813077    2.700000          5.1         1.9  virginica
#> 144     5.813077    3.200000          5.9         2.3  virginica
#> 145     6.700000    3.300000          5.7         2.5  virginica
#> 146     6.700000    3.000000          5.2         2.3  virginica
#> 147     6.300000    2.500000          5.0         1.9  virginica
#> 148     6.500000    3.000000          5.2         2.0  virginica
#> 149     6.200000    3.400000          5.4         2.3  virginica
#> 150     5.900000    3.000000          5.1         1.8  virginica

CodePudding user response:

library(tidyverse)

tribble(
  ~a, ~b, ~c,
  1, 1, 1,
  3, 4, 2,
  1, 1, 1,
  3, 4, 2,
  1, 1, 1,
  3, 4, 2,
  1, 1, 1,
  3, NA, 2,
  NA, NA, 1,
  NA, NA, NA,
  1, 4, 2
) |>
  mutate(across(
    everything(),
    ~ if_else(is.na(.) & mean(is.na(.)) < 0.1,
      mean(., na.rm = TRUE), .
    )
  ))
#> # A tibble: 11 × 3
#>        a     b     c
#>    <dbl> <dbl> <dbl>
#>  1     1     1   1  
#>  2     3     4   2  
#>  3     1     1   1  
#>  4     3     4   2  
#>  5     1     1   1  
#>  6     3     4   2  
#>  7     1     1   1  
#>  8     3    NA   2  
#>  9    NA    NA   1  
#> 10    NA    NA   1.5
#> 11     1     4   2

Created on 2022-08-09 by the reprex package (v2.0.1)

CodePudding user response:

Here’s a base R solution, using made-up data:

# compute % missing by row
row_pct_na <- rowSums(is.na(fakedata)) / ncol(fakedata)

# replace missings conditionally
for (i in seq_along(fakedata)) {
  fakedata[[i]][is.na(fakedata[[i]]) & row_pct_na < .1] <- mean(fakedata[[i]], na.rm = TRUE)
}

Check results: rows that had <10% missing should now have 0% missing, but rows that had >=10% missing should still have same percent missing.

new_row_pct_na <- rowSums(is.na(fakedata)) / ncol(fakedata)

# before imputation
row_pct_na

# [1] 0.07692308 0.03846154 0.03846154 0.11538462 0.03846154 0.07692308
# [7] 0.00000000 0.00000000 0.00000000 0.11538462

# after imputation
new_row_pct_na
# [1] 0.0000000 0.0000000 0.0000000 0.1153846 0.0000000 0.0000000 0.0000000
# [8] 0.0000000 0.0000000 0.1153846

Data prep:

set.seed(1)

fakedata <- list()
for (letter in letters) {
  fakedata[[letter]] <- rnorm(10)
  fakedata[[letter]][runif(10) > .95] <- NA
}

fakedata <- as.data.frame(fakedata)
  • Related