Home > Mobile >  Create column to sum consecutive TRUE or FALSE values, then remove all sequences with consecutive NA
Create column to sum consecutive TRUE or FALSE values, then remove all sequences with consecutive NA

Time:10-07

I have a dataframe (df) that has id, date, time, and location (lat and lon). My goal is to create a column that sums the length of consecutive NAs to remove consecutive NA series that are greater than a certain number.

Here's an example of my data:

table <- "id   date    time   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4
15 B     2012-11-09 00:00:00  NA    NA  
16 B     2012-11-09 01:00:00  NA    NA  
17 B     2012-11-09 02:00:00  NA    NA  
18 B     2012-11-09 03:00:00  NA    NA  
19 B     2012-11-09 04:00:00  NA    NA  
20 B     2012-11-09 05:00:00  36.6 -53.5"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

df <- df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime))

I created a new TRUE/FALSE column for NA values:

df$gap <- ifelse(is.na(df$lat), TRUE, FALSE)
head(df)

# A tibble: 6 x 5
  id    datetime              lat   lon gap   
  <chr> <dttm>              <dbl> <dbl> <lgl>
1 A     2011-10-03 05:00:00  35   -53.4 FALSE
2 A     2011-10-03 06:00:00  35.1 -53.4 FALSE
3 A     2011-10-03 07:00:00  NA    NA   TRUE
4 A     2011-10-03 08:00:00  NA    NA   TRUE
5 A     2011-10-03 09:00:00  35.1 -53.4 FALSE
6 A     2011-10-03 10:00:00  36.2 -53.6 FALSE

Then tried various solutions to sum consecutive TRUEs or FALSEs, but I can only come up with this:

df <- df %>%
  group_by(id, grp = with(rle(gap), rep(seq_along(lengths), lengths))) %>%
  mutate(length = seq_along(grp)) %>%
  ungroup() %>%
  select(-grp)
head(df)

# A tibble: 6 x 6
   id    datetime              lat   lon gap   length
   <chr> <dttm>              <dbl> <dbl> <lgl>  <int>
 1 A     2011-10-03 05:00:00  35   -53.4 FALSE      1
 2 A     2011-10-03 06:00:00  35.1 -53.4 FALSE      2
 3 A     2011-10-03 07:00:00  NA    NA   TRUE       1
 4 A     2011-10-03 08:00:00  NA    NA   TRUE       2
 5 A     2011-10-03 09:00:00  35.1 -53.4 FALSE      1
 6 A     2011-10-03 10:00:00  36.2 -53.6 FALSE      2

The issue is that the above adds a count for sequences 1, 2, 3, 4, 5, etc., whereas I want the entire sequence of points or NAs to contain the number of total consecutive TRUEs or FALSES (i.e. 5, 5, 5, 5, 5).

The desired output would be:

table <- "id      datetime              lat   lon   gap  length
1  A     2011-10-03 05:00:00  35   -53.4 FALSE 2
2  A     2011-10-03 06:00:00  35.1 -53.4 FALSE 2
3  A     2011-10-03 07:00:00  NA    NA   TRUE  2
4  A     2011-10-03 08:00:00  NA    NA   TRUE  2
5  A     2011-10-03 09:00:00  35.1 -53.4 FALSE 3
6  A     2011-10-03 10:00:00  36.2 -53.6 FALSE 3
7  A     2011-10-03 23:00:00  36.6 -53.6 FALSE 3
8  B     2012-11-08 05:00:00  35.8 -53.4 FALSE 1
9  B     2012-11-08 06:00:00  NA    NA   TRUE  1
10 B     2012-11-08 07:00:00  36   -53.4 FALSE 1
11 B     2012-11-08 08:00:00  NA    NA   TRUE  2
12 B     2012-11-08 09:00:00  NA    NA   TRUE  2
13 B     2012-11-08 10:00:00  36.5 -53.4 FALSE 2
14 B     2012-11-08 23:00:00  36.6 -53.4 FALSE 2
15 B     2012-11-09 00:00:00  NA    NA   TRUE  5
16 B     2012-11-09 01:00:00  NA    NA   TRUE  5
17 B     2012-11-09 02:00:00  NA    NA   TRUE  5
18 B     2012-11-09 03:00:00  NA    NA   TRUE  5
19 B     2012-11-09 04:00:00  NA    NA   TRUE  5
20 B     2012-11-09 05:00:00  36.6 -53.5 FALSE 1"

From here, I need to delete any ID from the dataset that has a length of 5 NAs or greater. The issue is that I do not want to remove an ID that has a length of 5 for non-NA values (i.e. IDs with more than 5 consecutive lat/lon positions in a row need to remain.

In this example, the desired output would be only individual A, because B had a length of NAs greater than 5:

table <- "id      datetime              lat   lon   gap  length
1  A     2011-10-03 05:00:00  35   -53.4 FALSE 2
2  A     2011-10-03 06:00:00  35.1 -53.4 FALSE 2
3  A     2011-10-03 07:00:00  NA    NA   TRUE  2
4  A     2011-10-03 08:00:00  NA    NA   TRUE  2
5  A     2011-10-03 09:00:00  35.1 -53.4 FALSE 3
6  A     2011-10-03 10:00:00  36.2 -53.6 FALSE 3
7  A     2011-10-03 23:00:00  36.6 -53.6 FALSE 3"

But I need to make sure the code that removes gaps of length 5 or greater does not remove IDs with lat/lon positions of length 5 or greater. I do not know where to start with this portion of my problem.

Any help would be appreciated

CodePudding user response:

tidyverse

df %>% 
  group_by(id) %>% 
  mutate(grp = data.table::rleid(is.na(lat))) %>% 
  group_by(grp, .add = TRUE) %>% 
  mutate(res = sum(is.na(lat))) %>% 
  group_by(id) %>% 
  filter(!any(res >= 5)) %>% 
  select(-c(grp, res)) %>% 
  ungroup()

# A tibble: 7 x 4
  id    datetime              lat   lon
  <chr> <dttm>              <dbl> <dbl>
1 A     2011-10-03 05:00:00  35   -53.4
2 A     2011-10-03 06:00:00  35.1 -53.4
3 A     2011-10-03 07:00:00  NA    NA  
4 A     2011-10-03 08:00:00  NA    NA  
5 A     2011-10-03 09:00:00  35.1 -53.4
6 A     2011-10-03 10:00:00  36.2 -53.6
7 A     2011-10-03 23:00:00  36.6 -53.6

data.table

library(data.table)
setDT(df)[, grp := rleid(is.na(lat)), by = list(id)] %>% 
  .[, grp := .N, by = list(grp, id)] %>% 
  .[, .SD[!any(grp >= 5)], by = id] %>% 
  .[]

   id            datetime  lat   lon grp
1:  A 2011-10-03 05:00:00 35.0 -53.4   2
2:  A 2011-10-03 06:00:00 35.1 -53.4   2
3:  A 2011-10-03 07:00:00   NA    NA   2
4:  A 2011-10-03 08:00:00   NA    NA   2
5:  A 2011-10-03 09:00:00 35.1 -53.4   3
6:  A 2011-10-03 10:00:00 36.2 -53.6   3
7:  A 2011-10-03 23:00:00 36.6 -53.6   3
  • Related