Remove NAs if gap is greater than certain time interval or certain number of rows-CodePudding

I have a dataframe of locations with NA values for some positions at certain datetimes. I would like to estimate positions for these NA values, but when there are more than 3 NA values in a row (gaps of more than 3 hours), I would like to remove those from the dataset (i.e. I do not want to estimate positions for gaps greater than 3 rows/3 hours of NAs).

Here's an example of my data:

table <- "id   date    time   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4
15 B     2012-11-09 00:00:00  NA    NA  
16 B     2012-11-09 01:00:00  NA    NA  
17 B     2012-11-09 02:00:00  NA    NA  
18 B     2012-11-09 03:00:00  NA    NA  
19 B     2012-11-09 04:00:00  NA    NA  
20 B     2012-11-09 05:00:00  36.6 -53.5"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime))

And here is an example of the desired output: (Notice how rows 15-19 are now removed because this was a gap of 5 NA values/5 hours).

table <- "id        datetime   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4 
15 B     2012-11-09 05:00:00  36.6 -53.5"

Besides individually selecting specific rows (which I cannot do because this dataset is large), I cannot figure out how to tell R to keep NAs only if they are in groups of 3 or less (3 hours or less). Any help would be appreciated!

CodePudding user response：

df %>%
  group_by(grp1 = cumsum(!is.na(lat) & !is.na(lon)), grp2 = !is.na(lat) & !is.na(lon)) %>%
  filter((!is.na(lat) & !is.na(lon)) | n() <= 3) %>%
  ungroup()
# # A tibble: 15 x 6
#    id    datetime              lat   lon  grp1 grp2 
#    <chr> <dttm>              <dbl> <dbl> <int> <lgl>
#  1 A     2011-10-03 05:00:00  35   -53.4     1 TRUE 
#  2 A     2011-10-03 06:00:00  35.1 -53.4     2 TRUE 
#  3 A     2011-10-03 07:00:00  NA    NA       2 FALSE
#  4 A     2011-10-03 08:00:00  NA    NA       2 FALSE
#  5 A     2011-10-03 09:00:00  35.1 -53.4     3 TRUE 
#  6 A     2011-10-03 10:00:00  36.2 -53.6     4 TRUE 
#  7 A     2011-10-03 23:00:00  36.6 -53.6     5 TRUE 
#  8 B     2012-11-08 05:00:00  35.8 -53.4     6 TRUE 
#  9 B     2012-11-08 06:00:00  NA    NA       6 FALSE
# 10 B     2012-11-08 07:00:00  36   -53.4     7 TRUE 
# 11 B     2012-11-08 08:00:00  NA    NA       7 FALSE
# 12 B     2012-11-08 09:00:00  NA    NA       7 FALSE
# 13 B     2012-11-08 10:00:00  36.5 -53.4     8 TRUE 
# 14 B     2012-11-08 23:00:00  36.6 -53.4     9 TRUE 
# 15 B     2012-11-09 05:00:00  36.6 -53.5    10 TRUE

This creates two (temporary) groups: one increments every time we have a non-NA row (of lat/lon), and then the second further subsets it so that we look at only NA-full rows (or not).

An alternative that only creates one new grouping column:

df %>%
  mutate(tmpdttm = if_else(!is.na(lat) & !is.na(lon), datetime, datetime[NA])) %>%
  tidyr::fill(tmpdttm) %>%
  group_by(tmpdttm) %>%
  filter(!is.na(lat) | n() <= 3) %>%
  ungroup()
# # A tibble: 15 x 5
#    id    datetime              lat   lon tmpdttm            
#    <chr> <dttm>              <dbl> <dbl> <dttm>             
#  1 A     2011-10-03 05:00:00  35   -53.4 2011-10-03 05:00:00
#  2 A     2011-10-03 06:00:00  35.1 -53.4 2011-10-03 06:00:00
#  3 A     2011-10-03 07:00:00  NA    NA   2011-10-03 06:00:00
#  4 A     2011-10-03 08:00:00  NA    NA   2011-10-03 06:00:00
#  5 A     2011-10-03 09:00:00  35.1 -53.4 2011-10-03 09:00:00
#  6 A     2011-10-03 10:00:00  36.2 -53.6 2011-10-03 10:00:00
#  7 A     2011-10-03 23:00:00  36.6 -53.6 2011-10-03 23:00:00
#  8 B     2012-11-08 05:00:00  35.8 -53.4 2012-11-08 05:00:00
#  9 B     2012-11-08 06:00:00  NA    NA   2012-11-08 05:00:00
# 10 B     2012-11-08 07:00:00  36   -53.4 2012-11-08 07:00:00
# 11 B     2012-11-08 08:00:00  NA    NA   2012-11-08 07:00:00
# 12 B     2012-11-08 09:00:00  NA    NA   2012-11-08 07:00:00
# 13 B     2012-11-08 10:00:00  36.5 -53.4 2012-11-08 10:00:00
# 14 B     2012-11-08 23:00:00  36.6 -53.4 2012-11-08 23:00:00
# 15 B     2012-11-09 05:00:00  36.6 -53.5 2012-11-09 05:00:00

CodePudding user response：

I broke mine up as a two-step process using tidyverse

df1 <- df %>% 
       group_by(id) %>% 
       mutate(gn = cumsum(!(is.na(lat) & is.na(lag(lat, default = 0))))) %>% 
       ungroup()
df1 %>% 
       group_by(id, gn) %>% 
       summarise(count = n()) %>% ungroup() %>% 
       filter(count < 5) %>% 
       inner_join(df1, by = c('id','gn'))

CodePudding user response：

Here's a tidyverse solution that uses rleid from data.table

library(data.table)
library(tidyverse)

df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime)) %>%
  group_by(datetime, new = rleid(is.na(lat))) %>% 
  ungroup() %>% 
  group_by(lat,lon,new) %>% 
  filter(n()<3) %>% 
  select(-new)

This gives us:

# A tibble: 15 x 5
     new id    datetime              lat   lon
   <int> <chr> <dttm>              <dbl> <dbl>
 1     1 A     2011-10-03 05:00:00  35   -53.4
 2     1 A     2011-10-03 06:00:00  35.1 -53.4
 3     2 A     2011-10-03 07:00:00  NA    NA  
 4     2 A     2011-10-03 08:00:00  NA    NA  
 5     3 A     2011-10-03 09:00:00  35.1 -53.4
 6     3 A     2011-10-03 10:00:00  36.2 -53.6
 7     3 A     2011-10-03 23:00:00  36.6 -53.6
 8     3 B     2012-11-08 05:00:00  35.8 -53.4
 9     4 B     2012-11-08 06:00:00  NA    NA  
10     5 B     2012-11-08 07:00:00  36   -53.4
11     6 B     2012-11-08 08:00:00  NA    NA  
12     6 B     2012-11-08 09:00:00  NA    NA  
13     7 B     2012-11-08 10:00:00  36.5 -53.4
14     7 B     2012-11-08 23:00:00  36.6 -53.4
15     9 B     2012-11-09 05:00:00  36.6 -53.5