I have a data.frame with a lot of observations. I am trying to filter/subset the data in way that only shows unique ID's that have index values that have more than one index value and where these values are ordered in series (1-2, 2-3, 1-3, 4-5, etc.). Any ID that only has one row or only one specific value within the index column should be removed.
Example of what the data looks like below:
df:
ID index
9930 1
9930 2
9930 3
9930 NA
9929 3
9929 3
9915 1
9916 3
9916 4
9999 7
9999 8
9999 10
Example of the desired output from above data:
df_new:
ID index
9916 3
9916 4
9930 1
9930 2
9930 3
9999 7
9999 8
Any help would be greatly appreciated!
CodePudding user response:
We could use diff
to create a grouping column after grouping by 'ID' and then filter
the groups having more than 1 row
library(dplyr)
df %>%
na.omit %>%
group_by(ID) %>%
mutate(new = cumsum(c(TRUE, diff(index) != 1))) %>%
group_by(new, .add = TRUE) %>%
filter(n() > 1) %>%
ungroup %>%
select(-new) %>%
arrange(ID)
-output
# A tibble: 7 x 2
ID index
<int> <int>
1 9916 3
2 9916 4
3 9930 1
4 9930 2
5 9930 3
6 9999 7
7 9999 8
data
df <- structure(list(ID = c(9930L, 9930L, 9930L, 9930L, 9929L, 9929L,
9915L, 9916L, 9916L, 9999L, 9999L, 9999L), index = c(1L, 2L,
3L, NA, 3L, 3L, 1L, 3L, 4L, 7L, 8L, 10L)),
class = "data.frame", row.names = c(NA,
-12L))
CodePudding user response:
A similar approach to @akrun that uses lag
rather than diff
:
df %>%
arrange(ID, index) %>%
group_by(ID) %>%
filter(n() > 1, !is.na(index)) %>%
mutate(tmp = index - lag(index)) %>%
filter(max(tmp, na.rm = TRUE) > 0, tmp %in% c(1, NA))
select(-tmp)
# A tibble: 7 x 2
ID index
<dbl> <dbl>
1 9916 3
2 9916 4
3 9930 1
4 9930 2
5 9930 3
6 9999 7
7 9999 8
CodePudding user response:
And this is another possible solution with the clever f
function
library(tidyverse)
df = tribble(
~ID, ~index,
9930, 1,
9930, 2,
9930, 3,
9930, NA,
9929, 3,
9929, 3,
9915, 1,
9916, 3,
9916, 4,
9999, 7,
9999, 8,
9999, 10)
f = function(x) if(length(x)>1) x-lag(x, default = 2*x[1]-x[2]) else 0
df %>% filter(!is.na(index)) %>%
arrange(ID, index) %>%
group_by(ID) %>%
mutate(nID = n(),
sInddex = f(index)) %>%
filter(nID>1, sInddex==1) %>%
select(-nID, -sInddex)
output
# A tibble: 7 x 2
# Groups: ID [3]
ID index
<dbl> <dbl>
1 9916 3
2 9916 4
3 9930 1
4 9930 2
5 9930 3
6 9999 7
7 9999 8