Home > OS >  Remove NAs if gap is greater than certain time interval or certain number of rows
Remove NAs if gap is greater than certain time interval or certain number of rows

Time:09-29

I have a dataframe of locations with NA values for some positions at certain datetimes. I would like to estimate positions for these NA values, but when there are more than 3 NA values in a row (gaps of more than 3 hours), I would like to remove those from the dataset (i.e. I do not want to estimate positions for gaps greater than 3 rows/3 hours of NAs).

Here's an example of my data:

table <- "id   date    time   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4
15 B     2012-11-09 00:00:00  NA    NA  
16 B     2012-11-09 01:00:00  NA    NA  
17 B     2012-11-09 02:00:00  NA    NA  
18 B     2012-11-09 03:00:00  NA    NA  
19 B     2012-11-09 04:00:00  NA    NA  
20 B     2012-11-09 05:00:00  36.6 -53.5"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df

df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime))

And here is an example of the desired output: (Notice how rows 15-19 are now removed because this was a gap of 5 NA values/5 hours).

table <- "id        datetime   lat   lon
 1 A     2011-10-03 05:00:00  35.0 -53.4
 2 A     2011-10-03 06:00:00  35.1 -53.4
 3 A     2011-10-03 07:00:00  NA    NA  
 4 A     2011-10-03 08:00:00  NA    NA  
 5 A     2011-10-03 09:00:00  35.1 -53.4
 6 A     2011-10-03 10:00:00  36.2 -53.6
 7 A     2011-10-03 23:00:00  36.6 -53.6
 8 B     2012-11-08 05:00:00  35.8 -53.4
 9 B     2012-11-08 06:00:00  NA    NA  
10 B     2012-11-08 07:00:00  36.0 -53.4
11 B     2012-11-08 08:00:00  NA    NA  
12 B     2012-11-08 09:00:00  NA    NA  
13 B     2012-11-08 10:00:00  36.5 -53.4
14 B     2012-11-08 23:00:00  36.6 -53.4 
15 B     2012-11-09 05:00:00  36.6 -53.5"

Besides individually selecting specific rows (which I cannot do because this dataset is large), I cannot figure out how to tell R to keep NAs only if they are in groups of 3 or less (3 hours or less). Any help would be appreciated!

CodePudding user response:

df %>%
  group_by(grp1 = cumsum(!is.na(lat) & !is.na(lon)), grp2 = !is.na(lat) & !is.na(lon)) %>%
  filter((!is.na(lat) & !is.na(lon)) | n() <= 3) %>%
  ungroup()
# # A tibble: 15 x 6
#    id    datetime              lat   lon  grp1 grp2 
#    <chr> <dttm>              <dbl> <dbl> <int> <lgl>
#  1 A     2011-10-03 05:00:00  35   -53.4     1 TRUE 
#  2 A     2011-10-03 06:00:00  35.1 -53.4     2 TRUE 
#  3 A     2011-10-03 07:00:00  NA    NA       2 FALSE
#  4 A     2011-10-03 08:00:00  NA    NA       2 FALSE
#  5 A     2011-10-03 09:00:00  35.1 -53.4     3 TRUE 
#  6 A     2011-10-03 10:00:00  36.2 -53.6     4 TRUE 
#  7 A     2011-10-03 23:00:00  36.6 -53.6     5 TRUE 
#  8 B     2012-11-08 05:00:00  35.8 -53.4     6 TRUE 
#  9 B     2012-11-08 06:00:00  NA    NA       6 FALSE
# 10 B     2012-11-08 07:00:00  36   -53.4     7 TRUE 
# 11 B     2012-11-08 08:00:00  NA    NA       7 FALSE
# 12 B     2012-11-08 09:00:00  NA    NA       7 FALSE
# 13 B     2012-11-08 10:00:00  36.5 -53.4     8 TRUE 
# 14 B     2012-11-08 23:00:00  36.6 -53.4     9 TRUE 
# 15 B     2012-11-09 05:00:00  36.6 -53.5    10 TRUE 

This creates two (temporary) groups: one increments every time we have a non-NA row (of lat/lon), and then the second further subsets it so that we look at only NA-full rows (or not).

An alternative that only creates one new grouping column:

df %>%
  mutate(tmpdttm = if_else(!is.na(lat) & !is.na(lon), datetime, datetime[NA])) %>%
  tidyr::fill(tmpdttm) %>%
  group_by(tmpdttm) %>%
  filter(!is.na(lat) | n() <= 3) %>%
  ungroup()
# # A tibble: 15 x 5
#    id    datetime              lat   lon tmpdttm            
#    <chr> <dttm>              <dbl> <dbl> <dttm>             
#  1 A     2011-10-03 05:00:00  35   -53.4 2011-10-03 05:00:00
#  2 A     2011-10-03 06:00:00  35.1 -53.4 2011-10-03 06:00:00
#  3 A     2011-10-03 07:00:00  NA    NA   2011-10-03 06:00:00
#  4 A     2011-10-03 08:00:00  NA    NA   2011-10-03 06:00:00
#  5 A     2011-10-03 09:00:00  35.1 -53.4 2011-10-03 09:00:00
#  6 A     2011-10-03 10:00:00  36.2 -53.6 2011-10-03 10:00:00
#  7 A     2011-10-03 23:00:00  36.6 -53.6 2011-10-03 23:00:00
#  8 B     2012-11-08 05:00:00  35.8 -53.4 2012-11-08 05:00:00
#  9 B     2012-11-08 06:00:00  NA    NA   2012-11-08 05:00:00
# 10 B     2012-11-08 07:00:00  36   -53.4 2012-11-08 07:00:00
# 11 B     2012-11-08 08:00:00  NA    NA   2012-11-08 07:00:00
# 12 B     2012-11-08 09:00:00  NA    NA   2012-11-08 07:00:00
# 13 B     2012-11-08 10:00:00  36.5 -53.4 2012-11-08 10:00:00
# 14 B     2012-11-08 23:00:00  36.6 -53.4 2012-11-08 23:00:00
# 15 B     2012-11-09 05:00:00  36.6 -53.5 2012-11-09 05:00:00

CodePudding user response:

I broke mine up as a two-step process using tidyverse

df1 <- df %>% 
       group_by(id) %>% 
       mutate(gn = cumsum(!(is.na(lat) & is.na(lag(lat, default = 0))))) %>% 
       ungroup()
df1 %>% 
       group_by(id, gn) %>% 
       summarise(count = n()) %>% ungroup() %>% 
       filter(count < 5) %>% 
       inner_join(df1, by = c('id','gn'))

CodePudding user response:

Here's a tidyverse solution that uses rleid from data.table

library(data.table)
library(tidyverse)

df %>%
  unite(datetime, date, time, sep = ' ') %>%
  mutate(datetime = lubridate::ymd_hms(datetime)) %>%
  group_by(datetime, new = rleid(is.na(lat))) %>% 
  ungroup() %>% 
  group_by(lat,lon,new) %>% 
  filter(n()<3) %>% 
  select(-new)

This gives us:

# A tibble: 15 x 5
     new id    datetime              lat   lon
   <int> <chr> <dttm>              <dbl> <dbl>
 1     1 A     2011-10-03 05:00:00  35   -53.4
 2     1 A     2011-10-03 06:00:00  35.1 -53.4
 3     2 A     2011-10-03 07:00:00  NA    NA  
 4     2 A     2011-10-03 08:00:00  NA    NA  
 5     3 A     2011-10-03 09:00:00  35.1 -53.4
 6     3 A     2011-10-03 10:00:00  36.2 -53.6
 7     3 A     2011-10-03 23:00:00  36.6 -53.6
 8     3 B     2012-11-08 05:00:00  35.8 -53.4
 9     4 B     2012-11-08 06:00:00  NA    NA  
10     5 B     2012-11-08 07:00:00  36   -53.4
11     6 B     2012-11-08 08:00:00  NA    NA  
12     6 B     2012-11-08 09:00:00  NA    NA  
13     7 B     2012-11-08 10:00:00  36.5 -53.4
14     7 B     2012-11-08 23:00:00  36.6 -53.4
15     9 B     2012-11-09 05:00:00  36.6 -53.5
  • Related