Home > Back-end >  Filter/Subset Data Based Off Order of Values in Index Column as they are Assigned to a Unique ID in
Filter/Subset Data Based Off Order of Values in Index Column as they are Assigned to a Unique ID in

Time:09-17

I have a data.frame with a lot of observations. I am trying to filter/subset the data in way that only shows unique ID's that have index values that have more than one index value and where these values are ordered in series (1-2, 2-3, 1-3, 4-5, etc.). Any ID that only has one row or only one specific value within the index column should be removed.

Example of what the data looks like below:

df:
 ID    index    
9930      1      
9930      2
9930      3
9930     NA      
9929      3       
9929      3     
9915      1      
9916      3      
9916      4
9999      7
9999      8
9999     10       

Example of the desired output from above data:

df_new:
ID    index
9916    3
9916    4
9930    1
9930    2
9930    3
9999    7
9999    8

Any help would be greatly appreciated!

CodePudding user response:

We could use diff to create a grouping column after grouping by 'ID' and then filter the groups having more than 1 row

library(dplyr)
df %>%
    na.omit %>%
    group_by(ID) %>% 
    mutate(new = cumsum(c(TRUE, diff(index) != 1))) %>% 
    group_by(new, .add = TRUE) %>% 
    filter(n() > 1) %>% 
    ungroup %>% 
    select(-new) %>%
    arrange(ID)

-output

# A tibble: 7 x 2
     ID index
  <int> <int>
1  9916     3
2  9916     4
3  9930     1
4  9930     2
5  9930     3
6  9999     7
7  9999     8

data

df <- structure(list(ID = c(9930L, 9930L, 9930L, 9930L, 9929L, 9929L, 
9915L, 9916L, 9916L, 9999L, 9999L, 9999L), index = c(1L, 2L, 
3L, NA, 3L, 3L, 1L, 3L, 4L, 7L, 8L, 10L)), 
class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response:

A similar approach to @akrun that uses lag rather than diff:

df %>%
    arrange(ID, index) %>%
    group_by(ID) %>%
    filter(n() > 1, !is.na(index)) %>%
    mutate(tmp = index - lag(index)) %>%
    filter(max(tmp, na.rm = TRUE) > 0, tmp %in% c(1, NA)) 
    select(-tmp)
# A tibble: 7 x 2
     ID index
  <dbl> <dbl>
1  9916     3
2  9916     4
3  9930     1
4  9930     2
5  9930     3
6  9999     7
7  9999     8

CodePudding user response:

And this is another possible solution with the clever f function

library(tidyverse)

df = tribble(
  ~ID,    ~index,   
  9930,     1,     
  9930,     2,
  9930,     3,
  9930,    NA,      
  9929,     3,      
  9929,     3,    
  9915,     1,     
  9916,     3,     
  9916,     4,
  9999,     7,
  9999,     8,
  9999,    10)  

f = function(x) if(length(x)>1) x-lag(x, default = 2*x[1]-x[2]) else 0

df %>% filter(!is.na(index)) %>% 
  arrange(ID, index) %>% 
  group_by(ID) %>% 
  mutate(nID = n(), 
         sInddex = f(index)) %>% 
  filter(nID>1, sInddex==1) %>% 
  select(-nID, -sInddex)

output

# A tibble: 7 x 2
# Groups:   ID [3]
     ID index
  <dbl> <dbl>
1  9916     3
2  9916     4
3  9930     1
4  9930     2
5  9930     3
6  9999     7
7  9999     8
  • Related