I have a dataframe (df) that has id
, date
, time
, and location (lat
and lon
). My goal is to create a column that sums the length of consecutive NAs to remove consecutive NA series that are greater than a certain number.
Here's an example of my data:
table <- "id date time lat lon
1 A 2011-10-03 05:00:00 35.0 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
8 B 2012-11-08 05:00:00 35.8 -53.4
9 B 2012-11-08 06:00:00 NA NA
10 B 2012-11-08 07:00:00 36.0 -53.4
11 B 2012-11-08 08:00:00 NA NA
12 B 2012-11-08 09:00:00 NA NA
13 B 2012-11-08 10:00:00 36.5 -53.4
14 B 2012-11-08 23:00:00 36.6 -53.4
15 B 2012-11-09 00:00:00 NA NA
16 B 2012-11-09 01:00:00 NA NA
17 B 2012-11-09 02:00:00 NA NA
18 B 2012-11-09 03:00:00 NA NA
19 B 2012-11-09 04:00:00 NA NA
20 B 2012-11-09 05:00:00 36.6 -53.5"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
df <- df %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime))
I created a new TRUE/FALSE column for NA values:
df$gap <- ifelse(is.na(df$lat), TRUE, FALSE)
head(df)
# A tibble: 6 x 5
id datetime lat lon gap
<chr> <dttm> <dbl> <dbl> <lgl>
1 A 2011-10-03 05:00:00 35 -53.4 FALSE
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE
3 A 2011-10-03 07:00:00 NA NA TRUE
4 A 2011-10-03 08:00:00 NA NA TRUE
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE
Then tried various solutions to sum consecutive TRUEs or FALSEs, but I can only come up with this:
df <- df %>%
group_by(id, grp = with(rle(gap), rep(seq_along(lengths), lengths))) %>%
mutate(length = seq_along(grp)) %>%
ungroup() %>%
select(-grp)
head(df)
# A tibble: 6 x 6
id datetime lat lon gap length
<chr> <dttm> <dbl> <dbl> <lgl> <int>
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 1
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 1
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 1
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 2
The issue is that the above adds a count for sequences 1, 2, 3, 4, 5, etc., whereas I want the entire sequence of points or NAs to contain the number of total consecutive TRUEs or FALSES (i.e. 5, 5, 5, 5, 5).
The desired output would be:
table <- "id datetime lat lon gap length
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 2
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 2
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 3
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 3
7 A 2011-10-03 23:00:00 36.6 -53.6 FALSE 3
8 B 2012-11-08 05:00:00 35.8 -53.4 FALSE 1
9 B 2012-11-08 06:00:00 NA NA TRUE 1
10 B 2012-11-08 07:00:00 36 -53.4 FALSE 1
11 B 2012-11-08 08:00:00 NA NA TRUE 2
12 B 2012-11-08 09:00:00 NA NA TRUE 2
13 B 2012-11-08 10:00:00 36.5 -53.4 FALSE 2
14 B 2012-11-08 23:00:00 36.6 -53.4 FALSE 2
15 B 2012-11-09 00:00:00 NA NA TRUE 5
16 B 2012-11-09 01:00:00 NA NA TRUE 5
17 B 2012-11-09 02:00:00 NA NA TRUE 5
18 B 2012-11-09 03:00:00 NA NA TRUE 5
19 B 2012-11-09 04:00:00 NA NA TRUE 5
20 B 2012-11-09 05:00:00 36.6 -53.5 FALSE 1"
From here, I need to delete any ID from the dataset that has a length of 5 NAs or greater. The issue is that I do not want to remove an ID that has a length of 5 for non-NA values (i.e. IDs with more than 5 consecutive lat/lon positions in a row need to remain.
In this example, the desired output would be only individual A, because B had a length of NAs greater than 5:
table <- "id datetime lat lon gap length
1 A 2011-10-03 05:00:00 35 -53.4 FALSE 2
2 A 2011-10-03 06:00:00 35.1 -53.4 FALSE 2
3 A 2011-10-03 07:00:00 NA NA TRUE 2
4 A 2011-10-03 08:00:00 NA NA TRUE 2
5 A 2011-10-03 09:00:00 35.1 -53.4 FALSE 3
6 A 2011-10-03 10:00:00 36.2 -53.6 FALSE 3
7 A 2011-10-03 23:00:00 36.6 -53.6 FALSE 3"
But I need to make sure the code that removes gaps of length 5 or greater does not remove IDs with lat/lon positions of length 5 or greater. I do not know where to start with this portion of my problem.
Any help would be appreciated
CodePudding user response:
tidyverse
df %>%
group_by(id) %>%
mutate(grp = data.table::rleid(is.na(lat))) %>%
group_by(grp, .add = TRUE) %>%
mutate(res = sum(is.na(lat))) %>%
group_by(id) %>%
filter(!any(res >= 5)) %>%
select(-c(grp, res)) %>%
ungroup()
# A tibble: 7 x 4
id datetime lat lon
<chr> <dttm> <dbl> <dbl>
1 A 2011-10-03 05:00:00 35 -53.4
2 A 2011-10-03 06:00:00 35.1 -53.4
3 A 2011-10-03 07:00:00 NA NA
4 A 2011-10-03 08:00:00 NA NA
5 A 2011-10-03 09:00:00 35.1 -53.4
6 A 2011-10-03 10:00:00 36.2 -53.6
7 A 2011-10-03 23:00:00 36.6 -53.6
data.table
library(data.table)
setDT(df)[, grp := rleid(is.na(lat)), by = list(id)] %>%
.[, grp := .N, by = list(grp, id)] %>%
.[, .SD[!any(grp >= 5)], by = id] %>%
.[]
id datetime lat lon grp
1: A 2011-10-03 05:00:00 35.0 -53.4 2
2: A 2011-10-03 06:00:00 35.1 -53.4 2
3: A 2011-10-03 07:00:00 NA NA 2
4: A 2011-10-03 08:00:00 NA NA 2
5: A 2011-10-03 09:00:00 35.1 -53.4 3
6: A 2011-10-03 10:00:00 36.2 -53.6 3
7: A 2011-10-03 23:00:00 36.6 -53.6 3