Replacing NA values with subset mean in R?-CodePudding

 head(df)
   county state_abr population unemp health_ins poverty SNAP no_comp no_internet home_broad broad_num broad_avail broad_cost price_bbn
1 Autauga        AL      55869   2.7        7.1    15.4 12.7    NA        20.9       78.9         0         0.0   67.32586     35.00
2 Baldwin        AL     223234   2.7       10.2    10.6  7.5    NA        21.3       78.1         0         0.0   67.32586     35.00
3 Barbour        AL      24686   3.8       11.2    28.9 27.4    NA        38.9       60.4         4        99.2   74.99000     35.00
4    Bibb        AL      22394   3.1        7.9    14.0 12.4    23.7        33.8       66.1         0         0.0   67.32586     35.00
5  Blount        AL      57826   2.7       11.0    NA   9.5    21.3        NA       68.5         0         0.0   67.32586     35.00
6 Bullock        AL      10101   3.6       10.8    31.4 25.9    27.1        40.1       58.9         1        40.1   57.99000     71.95

 Sublette        WY       9831   4.4       13.4     8.4  2.2     5.4        17.5       81.7         3        19.5      59.65 Sweetwater        WY      42343   3.9       12.0    12.0  5.8     7.7        16.1       82.4         5        95.1      63.30
    Teton        WY      23464   2.7       10.0     7.1  2.1     4.2        13.6       85.9         6        96.0      69.99
    Uinta        WY      20226   3.9       12.2    12.5  7.1     6.1        11.5       88.2         5        73.9      63.30
 Washakie        WY       7805   3.9       15.4    12.4  4.9    12.1        21.5       78.3         5        86.1      64.36
   Weston        WY       6927   2.9       13.3    17.4  4.7    13.8        26.1       73.3         2        52.0      66.67

I have this dataframe in R and I want to replace NA values with numeric values. The easy way is to get the mean of the column and replace it with the NA, but I want to be more precise.

Because my data frame is split up into states, in this case I'm just using a subset of WY and AL, I want to calculate the mean for that state and apply it accordingly to the NA value.

So, for example there's an 'NA' for no_comp on line 1 with state_abr AL. If I took the mean for no_comp, it would also include the mean of WY, which I don't want. I want to just calculate the mean for no_comp with state_abr 'AL' and apply it to the corresponding NA value.

CodePudding user response：

We may group by 'state_abr', loop over the numeric columns in mutate with across and replace the NA with mean value using na.aggregate from zoo. By default na.aggregate uses FUN = mean

library(zoo)
library(dplyr)
df1 <- df1 %>%
    group_by(state_abr) %>%
    mutate(across(where(is.numeric), na.aggregate)) %>%
    ungroup

Or if we want to not use additional package

df1 <- df1 %>%
      group_by(state_abr) %>%
      mutate(across(where(is.numeric), ~ replace(.x, is.na(.x), 
             mean(.x, na.rm = TRUE))))

CodePudding user response：

Or you can use this base R one-liner:

df[which(is.na(df$no_comp) == TRUE),]$no_comp <- ave(df$no_comp,df$state_abr, FUN = function(x) mean(x,na.rm = TRUE))[which(is.na(df$no_comp) == TRUE)]

#Data:
county <-c("Autauga","Baldwin","Barbour","Bibb","Blount","Bullock","Sublette","Teton","Uinta","Washakie","Weston")
state_abr <- c(rep("AL",6),rep("WY",5))
population <- c(55869,223234,24686,22394,57826,10101,9831,23464,20226,7805,6927)
unemp <-  c(2.7,2.7,3.8,3.1,2.7,3.6,4.4,2.7,3.9,3.9,2.9)
no_comp <- c(NA,NA,NA,23.7,21.3,27.1,5.4,4.2,6.1,12.1,13.8)

df <- data.frame(county,state_abr,population,unemp,no_comp)