Home > Blockchain >  replace NAs in data frame with 'average if' of row
replace NAs in data frame with 'average if' of row

Time:01-05

I have some data where each unique ID is a member of a group. There are some IDs with missing data, for these I'd like to take the average of the other members of the same group for that row.

For example, with the below data I'd like to replace the "NA" for id 3 in row V_2 with the average of the other Group A members for that row (average of 21 & 22). Similarly for id 7 in row V_3 it would be the average of 34 & 64.

Group=rep(c('A', 'B', 'C'), each=3)
id=1:9
V_1 = t(c(10,20,30,40,10,10,20,35,65))
V_2 = t(c(21,22,"NA",42,12,12,22,32,63))
V_3 = t(c(24,24,34,44,14,14,"NA",34,64))

df <- as.data.frame(rbind(Group, id, V_1, V_2, V_3))
df

Group  A  A  A  B  B  B  C  C  C
id     1  2  3  4  5  6  7  8  9
X     10 20 30 40 10 10 20 35 65
X.1   21 22 NA 42 12 12 22 32 63
X.2   24 24 34 44 14 14 NA 34 64

CodePudding user response:

Structuring the data the tidy way might make it easier. Package {Hmisc} offers a convenience impute helper (since this is such a frequent task). That way you could proceed as follows:

  • tidy the data
## example dataframe df:
set.seed(4711)
df <- data.frame(Group = gl(3, 3, labels = LETTERS[1:3]),
                 id = 1:9,
                 V_1 = sample(c(NA, 1:8)),
                 V_2 = sample(c(NA, 1:8)),
                 V_3 = sample(c(NA, 1:8))
                 )
## > df |> head()
##   Group id V_1 V_2 V_3
## 1     A  1   1   7   6
## 2     A  2   4   8   2
## 3     A  3   3   2   3
## 4     B  4   6   4   1
## 5     B  5   5   3   8
## 6     B  6  NA  NA   4
  • use {Hmisc} and {dplyr} together with the pipeline notation:
library(dplyr)
library(Hmisc)

df_imputed <- 
  df |>  mutate(across(V_1:V_3, impute, mean))
> df_imputed |> head()
  Group id V_1 V_2 V_3
1     A  1 1.0 7.0   6
2     A  2 4.0 8.0   2
3     A  3 3.0 2.0   3
4     B  4 6.0 4.0   1
5     B  5 5.0 3.0   8
6     B  6 4.5 4.5   4

Should you now prefer to replace missing values with groupwise medians instead of total means, the tidy arrangement (together with {dplyr}) requires only one additional group_by clause:

  df |> 
  group_by(Group) |>
  mutate(across(V_1:V_3, impute, median))

CodePudding user response:

An approach using dplyr. The warnings occur because data frame columns are all character in your example (because the character class Group is in row 1). So ideally the whole data frame should be transposed...

library(dplyr)
library(tidyr)

tibble(data.frame(t(df))) %>% 
  group_by(Group) %>% 
  mutate(across(X:X.2, ~ as.numeric(.x))) %>% 
  mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>% 
  t() %>% 
  as.data.frame()
        V1   V2   V3   V4   V5   V6   V7   V8   V9
Group    A    A    A    B    B    B    C    C    C
id       1    2    3    4    5    6    7    8    9
X       10   20   30   40   10   10   20   35   65
X.1   21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2     24   24   34   44   14   14   49   34   64
Warning messages:
1: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 1: Group = "A". 
2: Problem while computing `..1 = across(X:X.2, ~as.numeric(.x))`.
ℹ NAs introduced by coercion
ℹ The warning occurred in group 3: Group = "C".
Same example using transposed data
df_t %>% 
  group_by(Group) %>% 
  mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
  ungroup()
# A tibble: 9 × 5
  Group    id     X   X.1   X.2
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A         1    10  21      24
2 A         2    20  22      24
3 A         3    30  21.5    34
4 B         4    40  42      44
5 B         5    10  12      14
6 B         6    10  12      14
7 C         7    20  22      49
8 C         8    35  32      34
9 C         9    65  63      64

with transpose back to wider format

df_t %>% 
  group_by(Group) %>% 
  mutate(across(X:X.2, ~ replace_na(.x, mean(.x, na.rm=T)))) %>%
  t() %>%
  as.data.frame()
        V1   V2   V3   V4   V5   V6   V7   V8   V9
Group    A    A    A    B    B    B    C    C    C
id       1    2    3    4    5    6    7    8    9
X       10   20   30   40   10   10   20   35   65
X.1   21.0 22.0 21.5 42.0 12.0 12.0 22.0 32.0 63.0
X.2     24   24   34   44   14   14   49   34   64

transposed data

df_t <- structure(list(Group = c("A", "A", "A", "B", "B", "B", "C", "C", 
"C"), id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), X = c(10, 20, 30, 40, 
10, 10, 20, 35, 65), X.1 = c(21, 22, NA, 42, 12, 12, 22, 32, 
63), X.2 = c(24, 24, 34, 44, 14, 14, NA, 34, 64)), class = "data.frame", row.names = c("V1", 
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9"))
  • Related