How do I find the interval with the most values in a time series data frame with a lot of NAs?-CodePudding

I have a time series data frame consisting of 980 observations and 980 variables. There are a lot of NAs in the data set. I need to find the perfect interval with the most data (variables) in a given amount of observations. Let's say I want to limit my search to 50 observations (out of 980), how do I find the one observation-intervall that contains the most information (variables)? Preferably, I need to find an interval that has zero missing values.

Please find below a snippet of my data frame so that you can better understand what I mean:

I tried na.omit() but then I end up with zero observations because apparently every observation contains at least one missing value (variable).

CodePudding user response：

You can use this code to select rows which have a percentage of non NA-values:

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)

Output sample data:

   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1  NA 11 NA 31 41 NA 61 71 81  NA
2  NA 12 NA NA NA 52 62 NA NA  NA
3   3 13 23 NA NA 53 63 NA NA  NA
4   4 NA NA NA NA NA NA NA NA  94
5   5 15 NA NA NA 55 65 NA NA  95
6  NA 16 26 36 46 56 66 76 NA  NA
7  NA 17 27 NA 47 57 67 77 NA  NA
8   8 NA NA NA 48 58 NA 78 88  NA
9   9 19 29 NA 49 NA NA NA 89  99
10 10 NA 30 NA 50 60 NA 80 NA  NA

Use this to select rows with an amount of available data:

## Remove rows with more than 50% NA
dat[which(rowMeans(is.na(dat)) < 0.5), ]

Output:

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA 11 NA 31 41 NA 61 71 81  NA
6 NA 16 26 36 46 56 66 76 NA  NA
7 NA 17 27 NA 47 57 67 77 NA  NA
9  9 19 29 NA 49 NA NA NA 89  99

You can change the 0.5 in the code above to for example 50/980.

CodePudding user response：

Using the same reproducible example data from @Quinten

set.seed(0)
df = matrix(1:100, 10, 10)
df[sample(1:100, 50)] = NA
df = data.frame(df)

We can find how many variables have data in each row using

rs = rowSums(!is.na(df))

The row with the most data is then given by

df[which.max(rs),]
#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 6 NA 16 26 36 46 56 66 76 NA  NA

And the N rows with the most data are given by

N=4
df[sort(order(rs, decreasing = T)[1:N]),]
#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 NA 11 NA 31 41 NA 61 71 81  NA
# 6 NA 16 26 36 46 56 66 76 NA  NA
# 7 NA 17 27 NA 47 57 67 77 NA  NA
# 9  9 19 29 NA 49 NA NA NA 89  99