I have a dataframe of locations with NA values for some positions at certain datetimes. I would like to estimate positions for these NA values, but when there are more than 3 NA values in a row (gaps of more than 3 hours), I would like to remove those from the dataset (i.e. I do not want to estimate positions for gaps greater than 3 rows/3 hours of NAs).
Here's an example of my data:
table <- "id date time lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 00:00:00 NA NA
16 B 2012-11-09 01:00:00 NA NA
17 B 2012-11-09 02:00:00 NA NA
18 B 2012-11-09 03:00:00 NA NA
19 B 2012-11-09 04:00:00 NA NA
20 B 2012-11-09 05:00:00 36.6 -53.5"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime))
And here is an example of the desired output: (Notice how rows 15-19 are now removed because this was a gap of 5 NA values/5 hours).
table <- "id datetime lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 05:00:00 36.6 -53.5"
Besides individually selecting specific rows (which I cannot do because this dataset is large), I cannot figure out how to tell R to keep NAs only if they are in groups of 3 or less (3 hours or less). Any help would be appreciated!
CodePudding user response:
df %>%
group_by(grp1 = cumsum(!is.na(lat) & !is.na(lon)), grp2 = !is.na(lat) & !is.na(lon)) %>%
filter((!is.na(lat) & !is.na(lon)) | n() <= 3) %>%
ungroup()
# # A tibble: 15 x 6
# id datetime lat lon grp1 grp2
# <chr> <dttm> <dbl> <dbl> <int> <lgl>
# 1 A 2011-10-03 05:00:00 35 -53.4 1 TRUE
# 2 A 2011-10-03 06:00:00 35.1 -53.4 2 TRUE
# 3 A 2011-10-03 07:00:00 NA NA 2 FALSE
# 4 A 2011-10-03 08:00:00 NA NA 2 FALSE
# 5 A 2011-10-03 09:00:00 35.1 -53.4 3 TRUE
# 6 A 2011-10-03 10:00:00 36.2 -53.6 4 TRUE
# 7 A 2011-10-03 23:00:00 36.6 -53.6 5 TRUE
# 8 B 2012-11-08 05:00:00 35.8 -53.4 6 TRUE
# 9 B 2012-11-08 06:00:00 NA NA 6 FALSE
# 10 B 2012-11-08 07:00:00 36 -53.4 7 TRUE
# 11 B 2012-11-08 08:00:00 NA NA 7 FALSE
# 12 B 2012-11-08 09:00:00 NA NA 7 FALSE
# 13 B 2012-11-08 10:00:00 36.5 -53.4 8 TRUE
# 14 B 2012-11-08 23:00:00 36.6 -53.4 9 TRUE
# 15 B 2012-11-09 05:00:00 36.6 -53.5 10 TRUE
This creates two (temporary) groups: one increments every time we have a non-NA
row (of lat/lon), and then the second further subsets it so that we look at only NA
-full rows (or not).
An alternative that only creates one new grouping column:
df %>%
mutate(tmpdttm = if_else(!is.na(lat) & !is.na(lon), datetime, datetime[NA])) %>%
tidyr::fill(tmpdttm) %>%
group_by(tmpdttm) %>%
filter(!is.na(lat) | n() <= 3) %>%
ungroup()
# # A tibble: 15 x 5
# id datetime lat lon tmpdttm
# <chr> <dttm> <dbl> <dbl> <dttm>
# 1 A 2011-10-03 05:00:00 35 -53.4 2011-10-03 05:00:00
# 2 A 2011-10-03 06:00:00 35.1 -53.4 2011-10-03 06:00:00
# 3 A 2011-10-03 07:00:00 NA NA 2011-10-03 06:00:00
# 4 A 2011-10-03 08:00:00 NA NA 2011-10-03 06:00:00
# 5 A 2011-10-03 09:00:00 35.1 -53.4 2011-10-03 09:00:00
# 6 A 2011-10-03 10:00:00 36.2 -53.6 2011-10-03 10:00:00
# 7 A 2011-10-03 23:00:00 36.6 -53.6 2011-10-03 23:00:00
# 8 B 2012-11-08 05:00:00 35.8 -53.4 2012-11-08 05:00:00
# 9 B 2012-11-08 06:00:00 NA NA 2012-11-08 05:00:00
# 10 B 2012-11-08 07:00:00 36 -53.4 2012-11-08 07:00:00
# 11 B 2012-11-08 08:00:00 NA NA 2012-11-08 07:00:00
# 12 B 2012-11-08 09:00:00 NA NA 2012-11-08 07:00:00
# 13 B 2012-11-08 10:00:00 36.5 -53.4 2012-11-08 10:00:00
# 14 B 2012-11-08 23:00:00 36.6 -53.4 2012-11-08 23:00:00
# 15 B 2012-11-09 05:00:00 36.6 -53.5 2012-11-09 05:00:00
CodePudding user response:
I broke mine up as a two-step process using tidyverse
df1 <- df %>%
group_by(id) %>%
mutate(gn = cumsum(!(is.na(lat) & is.na(lag(lat, default = 0))))) %>%
ungroup()
df1 %>%
group_by(id, gn) %>%
summarise(count = n()) %>% ungroup() %>%
filter(count < 5) %>%
inner_join(df1, by = c('id','gn'))
CodePudding user response:
Here's a tidyverse
solution that uses rleid
from data.table
library(data.table)
library(tidyverse)
df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(datetime, new = rleid(is.na(lat))) %>%
ungroup() %>%
group_by(lat,lon,new) %>%
filter(n()<3) %>%
select(-new)
This gives us:
# A tibble: 15 x 5
new id datetime lat lon
<int> <chr> <dttm> <dbl> <dbl>
1 1 A 2011-10-03 05:00:00 35 -53.4
2 1 A 2011-10-03 06:00:00 35.1 -53.4
3 2 A 2011-10-03 07:00:00 NA NA
4 2 A 2011-10-03 08:00:00 NA NA
5 3 A 2011-10-03 09:00:00 35.1 -53.4
6 3 A 2011-10-03 10:00:00 36.2 -53.6
7 3 A 2011-10-03 23:00:00 36.6 -53.6
8 3 B 2012-11-08 05:00:00 35.8 -53.4
9 4 B 2012-11-08 06:00:00 NA NA
10 5 B 2012-11-08 07:00:00 36 -53.4
11 6 B 2012-11-08 08:00:00 NA NA
12 6 B 2012-11-08 09:00:00 NA NA
13 7 B 2012-11-08 10:00:00 36.5 -53.4
14 7 B 2012-11-08 23:00:00 36.6 -53.4
15 9 B 2012-11-09 05:00:00 36.6 -53.5