Create column to sum consecutive TRUE or FALSE values, then remove all sequences with consecutive NA-CodePudding

I have a dataframe (df) that has id, date, time, and location (lat and lon). My goal is to create a column that sums the length of consecutive NAs to remove consecutive NA series that are greater than a certain number.

Here's an example of my data:

table <- "id   date    time   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4
15 B     2012-11-09 00:00:00  NA    NA  
16 B     2012-11-09 01:00:00  NA    NA  
17 B     2012-11-09 02:00:00  NA    NA  
18 B     2012-11-09 03:00:00  NA    NA  
19 B     2012-11-09 04:00:00  NA    NA  
20 B     2012-11-09 05:00:00  36.6 -53.5"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

df <- df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime))

I created a new TRUE/FALSE column for NA values:

df$gap <- ifelse(is.na(df$lat), TRUE, FALSE)
head(df)

# A tibble: 6 x 5
  id    datetime              lat   lon gap   
  <chr> <dttm>              <dbl> <dbl> <lgl>
1 A     2011-10-03 05:00:00  35   -53.4 FALSE
2 A     2011-10-03 06:00:00  35.1 -53.4 FALSE
3 A     2011-10-03 07:00:00  NA    NA   TRUE
4 A     2011-10-03 08:00:00  NA    NA   TRUE
5 A     2011-10-03 09:00:00  35.1 -53.4 FALSE
6 A     2011-10-03 10:00:00  36.2 -53.6 FALSE

Then tried various solutions to sum consecutive TRUEs or FALSEs, but I can only come up with this:

df <- df %>%
  group_by(id, grp = with(rle(gap), rep(seq_along(lengths), lengths))) %>%
  mutate(length = seq_along(grp)) %>%
  ungroup() %>%
  select(-grp)
head(df)

# A tibble: 6 x 6
   id    datetime              lat   lon gap   length
   <chr> <dttm>              <dbl> <dbl> <lgl>  <int>
 1 A     2011-10-03 05:00:00  35   -53.4 FALSE      1
 2 A     2011-10-03 06:00:00  35.1 -53.4 FALSE      2
 3 A     2011-10-03 07:00:00  NA    NA   TRUE       1
 4 A     2011-10-03 08:00:00  NA    NA   TRUE       2
 5 A     2011-10-03 09:00:00  35.1 -53.4 FALSE      1
 6 A     2011-10-03 10:00:00  36.2 -53.6 FALSE      2

The issue is that the above adds a count for sequences 1, 2, 3, 4, 5, etc., whereas I want the entire sequence of points or NAs to contain the number of total consecutive TRUEs or FALSES (i.e. 5, 5, 5, 5, 5).

The desired output would be:

table <- "id      datetime              lat   lon   gap  length
1  A     2011-10-03 05:00:00  35   -53.4 FALSE 2
2  A     2011-10-03 06:00:00  35.1 -53.4 FALSE 2
3  A     2011-10-03 07:00:00  NA    NA   TRUE  2
4  A     2011-10-03 08:00:00  NA    NA   TRUE  2
5  A     2011-10-03 09:00:00  35.1 -53.4 FALSE 3
6  A     2011-10-03 10:00:00  36.2 -53.6 FALSE 3
7  A     2011-10-03 23:00:00  36.6 -53.6 FALSE 3
8  B     2012-11-08 05:00:00  35.8 -53.4 FALSE 1
9  B     2012-11-08 06:00:00  NA    NA   TRUE  1
10 B     2012-11-08 07:00:00  36   -53.4 FALSE 1
11 B     2012-11-08 08:00:00  NA    NA   TRUE  2
12 B     2012-11-08 09:00:00  NA    NA   TRUE  2
13 B     2012-11-08 10:00:00  36.5 -53.4 FALSE 2
14 B     2012-11-08 23:00:00  36.6 -53.4 FALSE 2
15 B     2012-11-09 00:00:00  NA    NA   TRUE  5
16 B     2012-11-09 01:00:00  NA    NA   TRUE  5
17 B     2012-11-09 02:00:00  NA    NA   TRUE  5
18 B     2012-11-09 03:00:00  NA    NA   TRUE  5
19 B     2012-11-09 04:00:00  NA    NA   TRUE  5
20 B     2012-11-09 05:00:00  36.6 -53.5 FALSE 1"

From here, I need to delete any ID from the dataset that has a length of 5 NAs or greater. The issue is that I do not want to remove an ID that has a length of 5 for non-NA values (i.e. IDs with more than 5 consecutive lat/lon positions in a row need to remain.

In this example, the desired output would be only individual A, because B had a length of NAs greater than 5:

table <- "id      datetime              lat   lon   gap  length
1  A     2011-10-03 05:00:00  35   -53.4 FALSE 2
2  A     2011-10-03 06:00:00  35.1 -53.4 FALSE 2
3  A     2011-10-03 07:00:00  NA    NA   TRUE  2
4  A     2011-10-03 08:00:00  NA    NA   TRUE  2
5  A     2011-10-03 09:00:00  35.1 -53.4 FALSE 3
6  A     2011-10-03 10:00:00  36.2 -53.6 FALSE 3
7  A     2011-10-03 23:00:00  36.6 -53.6 FALSE 3"

But I need to make sure the code that removes gaps of length 5 or greater does not remove IDs with lat/lon positions of length 5 or greater. I do not know where to start with this portion of my problem.

Any help would be appreciated

CodePudding user response：

tidyverse

df %>% 
  group_by(id) %>% 
  mutate(grp = data.table::rleid(is.na(lat))) %>% 
  group_by(grp, .add = TRUE) %>% 
  mutate(res = sum(is.na(lat))) %>% 
  group_by(id) %>% 
  filter(!any(res >= 5)) %>% 
  select(-c(grp, res)) %>% 
  ungroup()

# A tibble: 7 x 4
  id    datetime              lat   lon
  <chr> <dttm>              <dbl> <dbl>
1 A     2011-10-03 05:00:00  35   -53.4
2 A     2011-10-03 06:00:00  35.1 -53.4
3 A     2011-10-03 07:00:00  NA    NA  
4 A     2011-10-03 08:00:00  NA    NA  
5 A     2011-10-03 09:00:00  35.1 -53.4
6 A     2011-10-03 10:00:00  36.2 -53.6
7 A     2011-10-03 23:00:00  36.6 -53.6

data.table

library(data.table)
setDT(df)[, grp := rleid(is.na(lat)), by = list(id)] %>% 
  .[, grp := .N, by = list(grp, id)] %>% 
  .[, .SD[!any(grp >= 5)], by = id] %>% 
  .[]

   id            datetime  lat   lon grp
1:  A 2011-10-03 05:00:00 35.0 -53.4   2
2:  A 2011-10-03 06:00:00 35.1 -53.4   2
3:  A 2011-10-03 07:00:00   NA    NA   2
4:  A 2011-10-03 08:00:00   NA    NA   2
5:  A 2011-10-03 09:00:00 35.1 -53.4   3
6:  A 2011-10-03 10:00:00 36.2 -53.6   3
7:  A 2011-10-03 23:00:00 36.6 -53.6   3