I have a dataframe that contains NA
values, and I want to remove some rows that have an NA
(i.e., not complete cases). However, I only want to remove rows at the beginning and ending of the dataframe. So, I want to keep any rows that have an NA
that are not in the first or last rows of the dataframe. What is the most efficient way to simultaneously remove these rows with NA
s without using a row index? This is related to my previous question, but I also want to remove the first rows at the same time. There are other posts that also focus on removing only the first rows, but not both.
Data
df <- structure(list(var1 = 1:15,
var2 = c(3, NA, 3, NA, 2, NA, 3, 4, 2, NA, 4, 2, 45, 2, 1),
var3 = c(6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA),
var4 = c(NA, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, NA)),
class = "data.frame", row.names = c(NA, -15L))
Expected Output
So, in this example, I removed rows 1 to 2, and 12 to 15 since they have an NA
and row 3 and 11 does not have an NA
.
var1 var2 var3 var4
1 3 3 8 8
2 4 NA 9 9
3 5 2 10 10
4 6 NA 11 11
5 7 3 12 12
6 8 4 13 13
7 9 2 14 14
8 10 NA 15 15
9 11 4 16 16
I know that I could have 2 statements in filter to remove the top and bottom rows (shown below). But I'm wondering if there is a more efficient way to do this with really large datasets (open to any method tidyverse
, base R, data.table
, etc.).
library(dplyr)
df %>%
filter(cumsum(complete.cases(.)) != 0 &
rev(cumsum(rev(complete.cases(.)))) != 0)
CodePudding user response:
I would do
na_count <- rowSums(is.na(df))
df <- df %>%
slice(min(which(na_count==0)):max(which(na_count==0)))
Output
> df
var1 var2 var3 var4
1 3 3 8 8
2 4 NA 9 9
3 5 2 10 10
4 6 NA 11 11
5 7 3 12 12
6 8 4 13 13
7 9 2 14 14
8 10 NA 15 15
9 11 4 16 16
CodePudding user response:
base R
r <- rle(complete.cases(df))
str(r, vec.len = 9)
# List of 2
# $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
# $ values : logi [1:9] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
# - attr(*, "class")= chr "rle"
r$values[ -c(1, length(r$values)) ] <- TRUE
str(r, vec.len = 9)
# List of 2
# $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
# $ values : logi [1:9] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
# - attr(*, "class")= chr "rle"
df[inverse.rle(r),]
# var1 var2 var3 var4
# 3 3 3 8 8
# 4 4 NA 9 9
# 5 5 2 10 10
# 6 6 NA 11 11
# 7 7 3 12 12
# 8 8 4 13 13
# 9 9 2 14 14
# 10 10 NA 15 15
# 11 11 4 16 16
dplyr
For your question of efficiency, you can adapt the rle
solution to dplyr as well (that should be trivial), but I see no reason why the use of complete.cases
and cumany
/rev
would be a problem. You can improve on your attempt by not calculating complete.cases(.)
twice as you're doing, storing it in an interim column.
library(dplyr)
df %>%
mutate(aux = complete.cases(cur_data())) %>%
filter(cumany(aux) & rev(cumany(rev(aux))))
# var1 var2 var3 var4 aux
# 1 3 3 8 8 TRUE
# 2 4 NA 9 9 FALSE
# 3 5 2 10 10 TRUE
# 4 6 NA 11 11 FALSE
# 5 7 3 12 12 TRUE
# 6 8 4 13 13 TRUE
# 7 9 2 14 14 TRUE
# 8 10 NA 15 15 FALSE
# 9 11 4 16 16 TRUE
data.table
(Just an adaptation of the dplyr version.)
library(data.table)
setDT(df)
df[, aux := complete.cases(.SD)
][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]
# var1 var2 var3 var4 aux
# <int> <num> <int> <int> <lgcl>
# 1: 3 3 8 8 TRUE
# 2: 4 NA 9 9 FALSE
# 3: 5 2 10 10 TRUE
# 4: 6 NA 11 11 FALSE
# 5: 7 3 12 12 TRUE
# 6: 8 4 13 13 TRUE
# 7: 9 2 14 14 TRUE
# 8: 10 NA 15 15 FALSE
# 9: 11 4 16 16 TRUE
CodePudding user response:
continuing a rle
love fest:
(which(rle(rowSums(df_NA))$values != 'NA')[1]):dplyr::last(which(rle(rowSums(df_NA))$values != 'NA'))
[1] 3 4 5 6 7 8 9 10 11
or, dispensing with dplyr
(which(rle(rowSums(df_NA))$values != 'NA')[1]):(which(rle(rowSums(df_NA))$values != 'NA'))[[(length(which(rle(rowSums(df_NA))$values != 'NA')))]]
[1] 3 4 5 6 7 8 9 10 11
CodePudding user response:
Another possible solution (thanks, @r2evans, for suggesting complete.cases
):
library(dplyr)
df %>%
mutate(aux = !complete.cases(.)) %>%
filter(!cumall(aux)) %>%
arrange(desc(var1)) %>%
filter(!cumall(aux)) %>%
arrange(var1) %>%
select(-aux)
#> var1 var2 var3 var4
#> 1 3 3 8 8
#> 2 4 NA 9 9
#> 3 5 2 10 10
#> 4 6 NA 11 11
#> 5 7 3 12 12
#> 6 8 4 13 13
#> 7 9 2 14 14
#> 8 10 NA 15 15
#> 9 11 4 16 16