How to replace missing points in a data set?-CodePudding

I want to write a function in R that receives any data set as input, such that the data set has some missing points (NA). Now I want to use mean function to replace some numbers/values for missing points (NA) in the data set. What I am thinking is a function like this:

x<function(data,type=c("mean", lag=2))

Indeed, it should compute the mean of the two numbers later and two numbers before of the missing point (because I considered lag as 2 in the function). For example, if the missing point is in place 12th then the function should compute the mean of the numbers in places 10th, 11th, 13th, and 14th and substitute the result for the missing point at place 12th. In particular cases, for example, if the missing point is in the last place, and we do not have two numbers later, the function should compute the mean of all the data of the corresponding column and substitute for the missing point. Here I give an example to make it clear. Consider the following data set:

3  7 8 0  8  12 2
5  8 9 2  8  9  1
1  2 4 5  0  6  7
5  6 0 NA 3  9  10
7  2 3 6  11 14 2
4  8 7 4  5  3  NA

In the above data set, the first NA should be replaced with the mean of numbers 2, 5 (two data before), and 6 and 4 (two data after) which is (2 5 6 4)/4 is equal to 17/4. And the last NA should be replaced with the mean of the last column which is (2 1 7 10 2)/5 is equal to 22/5.

My question is how can I add some codes (if, if-else, or other loops) to the above function to make a complete function to satisfy the above explanations. I should highlight that I want to use the family of apply functions.

CodePudding user response：

First we can define a function that smooths a single vector:

library(dplyr)

smooth = function(vec, n=2){
    # Lead and lag the vector twice in both directions
    purrr::map(1:n, function(i){
        cbind(
            lead(vec, i),
            lag(vec, i)
        )
    }) %>%
        # Bind the matrix together
        do.call(cbind, .) %>%
        # Take the mean of each row, ie the smoothed version at each position
        # If there are NAs in the mean, it will itself be NA
        rowMeans() %>%
        # In order, take a) original values b) locally smoothed values
        # c) globally smoothed values (ie the entire mean ignoring NAs)
        coalesce(vec, ., mean(vec, na.rm=TRUE))
}

> smooth(c(0, 2, 5, NA, 6, 4))
[1] 0.00 2.00 5.00 4.25 6.00 4.00
> smooth(c(2, 1, 7, 10, 2, NA))
[1]  2.0  1.0  7.0 10.0  2.0  4.4

Then we can apply it to each column:

> c(3, 7, 8, 0, 8, 12, 2, 5, 8, 9, 2, 8, 9, 1, 1, 2, 4, 5, 0, 6, 7, 5, 6, 0, NA, 3, 9, 10, 7, 2, 3, 6, 11, 14, 2, 4, 8, 7, 4, 5, 3, NA) %>% 
    matrix(byrow=TRUE, ncol=7) %>%
    as_tibble(.name_repair="universal") %>%                        
    mutate(across(everything(), smooth))
# A tibble: 6 × 7
   ...1  ...2  ...3  ...4  ...5  ...6  ...7
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     3     7     8  0        8    12   2  
2     5     8     9  2        8     9   1  
3     1     2     4  5        0     6   7  
4     5     6     0  4.25     3     9  10  
5     7     2     3  6       11    14   2  
6     4     8     7  4        5     3   4.4

CodePudding user response：

Please find below one solution using the data.table library.

Reprex

Your data:

m1 <- "3  7 8 0  8  12 2
       5  8 9 2  8  9  1
       1  2 4 5  0  6  7
       5  6 0 NA 3  9  10
       7  2 3 6  11 14 2
       4  8 7 4  5  3  NA"

myData<- read.table(text=m1,h=F)

Code for the function replaceNA

library(data.table)

replaceNA <- function(data){
  
  setDT(data)
  
  # Create a data.table identifying rows and cols indexes of NA values in the data.table
  NA_DT <- as.data.table(which(is.na(data), arr.ind=TRUE))
  
  # Select row and column indexes of NAs that are not at the last row in the data.table
  NA_not_Last <- NA_DT[row < nrow(data)]
  
  # Select row and column indexes of NA that is at the last row in the data.table
  NA_Last <- NA_DT[row == nrow(data)]
  
  # Create a vector of column names where NA values are not at the last row in the data.table
  Cols_NA_not_Last <- colnames(data)[NA_not_Last[,col]]
  
  
  # Create a vector of column names where NA values are at the last row in the data.table
  Cols_NA_Last <- colnames(data)[NA_Last[,col]]
  
  # Replace NA values that are not at the last row in the data.table by the mean of the values located 
  # in the two previous lines and the two following lines of the line containing the NA value
  data[, (Cols_NA_not_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(c(x[which(is.na(x))-2], x[which(is.na(x))-1], x[which(is.na(x)) 1], x[which(is.na(x)) 2]), na.rm = TRUE))), .SDcols = Cols_NA_not_Last][]
  
  # Replace NA values that are at the last row in the data.table by the mean of all the values in the column where the NA value is found 
  data[, (Cols_NA_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(x, na.rm = TRUE))), .SDcols = Cols_NA_Last][]

  return(data)
}

Test of the function with your data

replaceNA(myData)
#>    V1 V2 V3   V4 V5 V6   V7
#> 1:  3  7  8 0.00  8 12  2.0
#> 2:  5  8  9 2.00  8  9  1.0
#> 3:  1  2  4 5.00  0  6  7.0
#> 4:  5  6  0 4.25  3  9 10.0
#> 5:  7  2  3 6.00 11 14  2.0
#> 6:  4  8  7 4.00  5  3  4.4

^{Created on 2021-11-08 by the reprex package (v2.0.1)}