I have a time series data frame consisting of 980 observations and 980 variables. There are a lot of NAs in the data set. I need to find the perfect interval with the most data (variables) in a given amount of observations. Let's say I want to limit my search to 50 observations (out of 980), how do I find the one observation-intervall that contains the most information (variables)? Preferably, I need to find an interval that has zero missing values.
Please find below a snippet of my data frame so that you can better understand what I mean:
I tried na.omit()
but then I end up with zero observations because apparently every observation contains at least one missing value (variable).
CodePudding user response:
You can use this code to select rows which have a percentage of non NA-values:
## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)
Output sample data:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA 11 NA 31 41 NA 61 71 81 NA
2 NA 12 NA NA NA 52 62 NA NA NA
3 3 13 23 NA NA 53 63 NA NA NA
4 4 NA NA NA NA NA NA NA NA 94
5 5 15 NA NA NA 55 65 NA NA 95
6 NA 16 26 36 46 56 66 76 NA NA
7 NA 17 27 NA 47 57 67 77 NA NA
8 8 NA NA NA 48 58 NA 78 88 NA
9 9 19 29 NA 49 NA NA NA 89 99
10 10 NA 30 NA 50 60 NA 80 NA NA
Use this to select rows with an amount of available data:
## Remove rows with more than 50% NA
dat[which(rowMeans(is.na(dat)) < 0.5), ]
Output:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA 11 NA 31 41 NA 61 71 81 NA
6 NA 16 26 36 46 56 66 76 NA NA
7 NA 17 27 NA 47 57 67 77 NA NA
9 9 19 29 NA 49 NA NA NA 89 99
You can change the 0.5 in the code above to for example 50/980.
CodePudding user response:
Using the same reproducible example data from @Quinten
set.seed(0)
df = matrix(1:100, 10, 10)
df[sample(1:100, 50)] = NA
df = data.frame(df)
We can find how many variables have data in each row using
rs = rowSums(!is.na(df))
The row with the most data is then given by
df[which.max(rs),]
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 6 NA 16 26 36 46 56 66 76 NA NA
And the N rows with the most data are given by
N=4
df[sort(order(rs, decreasing = T)[1:N]),]
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 NA 11 NA 31 41 NA 61 71 81 NA
# 6 NA 16 26 36 46 56 66 76 NA NA
# 7 NA 17 27 NA 47 57 67 77 NA NA
# 9 9 19 29 NA 49 NA NA NA 89 99