I want to write a function in R
that receives any data set as input, such that the data set has some missing points (NA
). Now I want to use mean
function to replace some numbers/values for missing points (NA
) in the data set. What I am thinking is a function like this:
x<function(data,type=c("mean", lag=2))
Indeed, it should compute the mean of the two numbers later and two numbers before of the missing point (because I considered lag
as 2
in the function). For example, if the missing point is in place 12th then the function should compute the mean of the numbers in places 10th, 11th, 13th, and 14th and substitute the result for the missing point at place 12th. In particular cases, for example, if the missing point is in the last place, and we do not have two numbers later, the function should compute the mean of all the data of the corresponding column and substitute for the missing point. Here I give an example to make it clear. Consider the following data set:
3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA
In the above data set, the first NA
should be replaced with the mean of numbers 2
, 5
(two data before), and 6
and 4
(two data after) which is (2 5 6 4)/4
is equal to 17/4
. And the last NA
should be replaced with the mean of the last column which is (2 1 7 10 2)/5
is equal to 22/5
.
My question is how can I add some codes (if
, if-else
, or other loops) to the above function to make a complete function to satisfy the above explanations. I should highlight that I want to use the family of apply
functions.
CodePudding user response:
First we can define a function that smooths a single vector:
library(dplyr)
smooth = function(vec, n=2){
# Lead and lag the vector twice in both directions
purrr::map(1:n, function(i){
cbind(
lead(vec, i),
lag(vec, i)
)
}) %>%
# Bind the matrix together
do.call(cbind, .) %>%
# Take the mean of each row, ie the smoothed version at each position
# If there are NAs in the mean, it will itself be NA
rowMeans() %>%
# In order, take a) original values b) locally smoothed values
# c) globally smoothed values (ie the entire mean ignoring NAs)
coalesce(vec, ., mean(vec, na.rm=TRUE))
}
> smooth(c(0, 2, 5, NA, 6, 4))
[1] 0.00 2.00 5.00 4.25 6.00 4.00
> smooth(c(2, 1, 7, 10, 2, NA))
[1] 2.0 1.0 7.0 10.0 2.0 4.4
Then we can apply it to each column:
> c(3, 7, 8, 0, 8, 12, 2, 5, 8, 9, 2, 8, 9, 1, 1, 2, 4, 5, 0, 6, 7, 5, 6, 0, NA, 3, 9, 10, 7, 2, 3, 6, 11, 14, 2, 4, 8, 7, 4, 5, 3, NA) %>%
matrix(byrow=TRUE, ncol=7) %>%
as_tibble(.name_repair="universal") %>%
mutate(across(everything(), smooth))
# A tibble: 6 × 7
...1 ...2 ...3 ...4 ...5 ...6 ...7
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 7 8 0 8 12 2
2 5 8 9 2 8 9 1
3 1 2 4 5 0 6 7
4 5 6 0 4.25 3 9 10
5 7 2 3 6 11 14 2
6 4 8 7 4 5 3 4.4
CodePudding user response:
Please find below one solution using the data.table
library.
Reprex
- Your data:
m1 <- "3 7 8 0 8 12 2
5 8 9 2 8 9 1
1 2 4 5 0 6 7
5 6 0 NA 3 9 10
7 2 3 6 11 14 2
4 8 7 4 5 3 NA"
myData<- read.table(text=m1,h=F)
- Code for the function
replaceNA
library(data.table)
replaceNA <- function(data){
setDT(data)
# Create a data.table identifying rows and cols indexes of NA values in the data.table
NA_DT <- as.data.table(which(is.na(data), arr.ind=TRUE))
# Select row and column indexes of NAs that are not at the last row in the data.table
NA_not_Last <- NA_DT[row < nrow(data)]
# Select row and column indexes of NA that is at the last row in the data.table
NA_Last <- NA_DT[row == nrow(data)]
# Create a vector of column names where NA values are not at the last row in the data.table
Cols_NA_not_Last <- colnames(data)[NA_not_Last[,col]]
# Create a vector of column names where NA values are at the last row in the data.table
Cols_NA_Last <- colnames(data)[NA_Last[,col]]
# Replace NA values that are not at the last row in the data.table by the mean of the values located
# in the two previous lines and the two following lines of the line containing the NA value
data[, (Cols_NA_not_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(c(x[which(is.na(x))-2], x[which(is.na(x))-1], x[which(is.na(x)) 1], x[which(is.na(x)) 2]), na.rm = TRUE))), .SDcols = Cols_NA_not_Last][]
# Replace NA values that are at the last row in the data.table by the mean of all the values in the column where the NA value is found
data[, (Cols_NA_Last) := lapply(.SD, function(x) replace(x, which(is.na(x)), mean(x, na.rm = TRUE))), .SDcols = Cols_NA_Last][]
return(data)
}
- Test of the function with your data
replaceNA(myData)
#> V1 V2 V3 V4 V5 V6 V7
#> 1: 3 7 8 0.00 8 12 2.0
#> 2: 5 8 9 2.00 8 9 1.0
#> 3: 1 2 4 5.00 0 6 7.0
#> 4: 5 6 0 4.25 3 9 10.0
#> 5: 7 2 3 6.00 11 14 2.0
#> 6: 4 8 7 4.00 5 3 4.4
Created on 2021-11-08 by the reprex package (v2.0.1)