Home > database >  Simultaneously remove the first and last rows of a data frame until reaching a row that does not hav
Simultaneously remove the first and last rows of a data frame until reaching a row that does not hav

Time:04-08

I have a dataframe that contains NA values, and I want to remove some rows that have an NA (i.e., not complete cases). However, I only want to remove rows at the beginning and ending of the dataframe. So, I want to keep any rows that have an NA that are not in the first or last rows of the dataframe. What is the most efficient way to simultaneously remove these rows with NAs without using a row index? This is related to my previous question, but I also want to remove the first rows at the same time. There are other posts that also focus on removing only the first rows, but not both.

Data

df <- structure(list(var1 = 1:15, 
                     var2 = c(3, NA, 3, NA, 2, NA, 3, 4, 2, NA, 4, 2, 45, 2, 1), 
                     var3 = c(6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA), 
                     var4 = c(NA, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, NA)), 
                class = "data.frame", row.names = c(NA, -15L))

Expected Output

So, in this example, I removed rows 1 to 2, and 12 to 15 since they have an NA and row 3 and 11 does not have an NA.

  var1 var2 var3 var4
1    3    3    8    8
2    4   NA    9    9
3    5    2   10   10
4    6   NA   11   11
5    7    3   12   12
6    8    4   13   13
7    9    2   14   14
8   10   NA   15   15
9   11    4   16   16

I know that I could have 2 statements in filter to remove the top and bottom rows (shown below). But I'm wondering if there is a more efficient way to do this with really large datasets (open to any method tidyverse, base R, data.table, etc.).

library(dplyr)

df %>% 
  filter(cumsum(complete.cases(.)) != 0 & 
           rev(cumsum(rev(complete.cases(.)))) != 0)

CodePudding user response:

I would do

na_count <- rowSums(is.na(df))
df <- df %>%
  slice(min(which(na_count==0)):max(which(na_count==0)))

Output

> df
  var1 var2 var3 var4
1    3    3    8    8
2    4   NA    9    9
3    5    2   10   10
4    6   NA   11   11
5    7    3   12   12
6    8    4   13   13
7    9    2   14   14
8   10   NA   15   15
9   11    4   16   16

CodePudding user response:

base R

r <- rle(complete.cases(df))
str(r, vec.len = 9)
# List of 2
#  $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
#  $ values : logi [1:9] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
#  - attr(*, "class")= chr "rle"
r$values[ -c(1, length(r$values)) ] <- TRUE
str(r, vec.len = 9)
# List of 2
#  $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
#  $ values : logi [1:9] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
#  - attr(*, "class")= chr "rle"
df[inverse.rle(r),]
#    var1 var2 var3 var4
# 3     3    3    8    8
# 4     4   NA    9    9
# 5     5    2   10   10
# 6     6   NA   11   11
# 7     7    3   12   12
# 8     8    4   13   13
# 9     9    2   14   14
# 10   10   NA   15   15
# 11   11    4   16   16

dplyr

For your question of efficiency, you can adapt the rle solution to dplyr as well (that should be trivial), but I see no reason why the use of complete.cases and cumany/rev would be a problem. You can improve on your attempt by not calculating complete.cases(.) twice as you're doing, storing it in an interim column.

library(dplyr)
df %>%
  mutate(aux = complete.cases(cur_data())) %>%
  filter(cumany(aux) & rev(cumany(rev(aux))))
#   var1 var2 var3 var4   aux
# 1    3    3    8    8  TRUE
# 2    4   NA    9    9 FALSE
# 3    5    2   10   10  TRUE
# 4    6   NA   11   11 FALSE
# 5    7    3   12   12  TRUE
# 6    8    4   13   13  TRUE
# 7    9    2   14   14  TRUE
# 8   10   NA   15   15 FALSE
# 9   11    4   16   16  TRUE

data.table

(Just an adaptation of the dplyr version.)

library(data.table)
setDT(df)
df[, aux := complete.cases(.SD)
  ][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]
#     var1  var2  var3  var4    aux
#    <int> <num> <int> <int> <lgcl>
# 1:     3     3     8     8   TRUE
# 2:     4    NA     9     9  FALSE
# 3:     5     2    10    10   TRUE
# 4:     6    NA    11    11  FALSE
# 5:     7     3    12    12   TRUE
# 6:     8     4    13    13   TRUE
# 7:     9     2    14    14   TRUE
# 8:    10    NA    15    15  FALSE
# 9:    11     4    16    16   TRUE

CodePudding user response:

continuing a rle love fest:

(which(rle(rowSums(df_NA))$values != 'NA')[1]):dplyr::last(which(rle(rowSums(df_NA))$values != 'NA'))
[1]  3  4  5  6  7  8  9 10 11

or, dispensing with dplyr

(which(rle(rowSums(df_NA))$values != 'NA')[1]):(which(rle(rowSums(df_NA))$values != 'NA'))[[(length(which(rle(rowSums(df_NA))$values != 'NA')))]]
[1]  3  4  5  6  7  8  9 10 11

CodePudding user response:

Another possible solution (thanks, @r2evans, for suggesting complete.cases):

library(dplyr)

df %>% 
  mutate(aux = !complete.cases(.)) %>% 
  filter(!cumall(aux)) %>% 
  arrange(desc(var1)) %>% 
  filter(!cumall(aux)) %>%
  arrange(var1) %>% 
  select(-aux)

#>   var1 var2 var3 var4
#> 1    3    3    8    8
#> 2    4   NA    9    9
#> 3    5    2   10   10
#> 4    6   NA   11   11
#> 5    7    3   12   12
#> 6    8    4   13   13
#> 7    9    2   14   14
#> 8   10   NA   15   15
#> 9   11    4   16   16
  • Related