Retaining observations inside a group_id that meet multiple conditions-CodePudding

I'm trying to sort some data, which is labelled with an id, group_id, and a status code. Here's an example:

library(tidyverse)
sample <- tibble(id = c(1:20),
                 group_id = c(rep(1,3),rep(4,4),rep(8,5),rep(13,3),rep(16,1),rep(17,1),rep(18,3)),
                 status_code = c("initial","initial","final","initial","initial","initial","sold","initial","initial","initial","initial","error",
                            "initial","initial","final","final","error","initial","final","final")
)


sample %>%
  head()

# A tibble: 6 x 3
     id group_id status_code 
  <int>    <dbl> <chr>  
1     1        1 initial
2     2        1 initial
3     3        1 final  
4     4        4 initial
5     5        4 initial
6     6        4 initial

What I would like to do is group them by group_id and retain only observations that meet the following criteria:

status code (final, sold, error) appears more than once in the group_id
there is more than one occurence of either "final", "sold", or "error"

From my example above, group_id == 18 should be filtered because the code"final" appears twice. But this should also be the case had one of the "final" codes been replaced with either "sold", or "error".

Essentially, I'd like to be able to easily split the data as follows:

# Conditions met
# A tibble: 3 x 3
     id group_id status_code
  <int>    <dbl> <chr>      
1    18       18 initial    
2    19       18 final      
3    20       18 final   

# Conditions not met
# A tibble: 17 x 3
      id group_id status_code
   <int>    <dbl> <chr>      
 1     1        1 initial    
 2     2        1 initial    
 3     3        1 final      
 4     4        4 initial    
 5     5        4 initial    
 6     6        4 initial    
 7     7        4 sold       
 8     8        8 initial    
 9     9        8 initial    
10    10        8 initial    
11    11        8 initial    
12    12        8 error      
13    13       13 initial    
14    14       13 initial    
15    15       13 final      
16    16       16 final      
17    17       17 error

My data obviously has more columns, which should all be retained in each subframe. So far I've been able to achieve this by counting, keeping the status_code and filtering the main data again:

condition <- sample %>%
  filter(status_code != "initial") %>% # Removing "initial" status_code
  count(group_id) %>% # Counting the number of remaining occurences in each group
  filter(n > 1) %>% # Keeping groups where more than one occurence
  select(group_id) # Selecting the group_ids

fitlered_sample <- left_join(condition,sample, by ="group_id")

# A tibble: 3 x 3
  group_id    id status_code
     <dbl> <int> <chr>      
1       18    18 initial    
2       18    19 final      
3       18    20 final

It works, but I was hoping this could be achieved without creating a "condition" dataframe. I'd prefer a dplyr solution, but I can also work with base or data.table

CodePudding user response：

We may group_by 'group_id' and then filter on the count of 'status_code' not equal to 'initial'

library(dplyr)
sample %>% 
 group_by(group_id) %>%
 filter(sum(status_code != "initial") > 1) %>%
 ungroup

For the other data

sample %>% 
  group_by(group_id) %>% 
  filter(sum(status_code != "initial") == 1) %>%
  ungroup

Or use group_split to create a list of datasets

lst1 <- sample %>% 
 group_by(group_id) %>% 
 mutate(n = sum(status_code != "initial") != 1) %>% 
 ungroup %>% 
 group_split(n, .keep = FALSE)

-output

lst1
[[1]]
# A tibble: 17 × 3
      id group_id status_code
   <int>    <dbl> <chr>      
 1     1        1 initial    
 2     2        1 initial    
 3     3        1 final      
 4     4        4 initial    
 5     5        4 initial    
 6     6        4 initial    
 7     7        4 sold       
 8     8        8 initial    
 9     9        8 initial    
10    10        8 initial    
11    11        8 initial    
12    12        8 error      
13    13       13 initial    
14    14       13 initial    
15    15       13 final      
16    16       16 final      
17    17       17 error      

[[2]]
# A tibble: 3 × 3
     id group_id status_code
  <int>    <dbl> <chr>      
1    18       18 initial    
2    19       18 final      
3    20       18 final

CodePudding user response：

With data.table:

library(data.table)
setDT(sample)

sample[,.SD[sum(status_code=="final")>1|
            sum(status_code=="sold") >1|
            sum(status_code=="error")>1],
       by=group_id]

   group_id    id status_code
      <num> <int>      <char>
1:       18    18     initial
2:       18    19       final
3:       18    20       final

An the opposite:

sample[,.SD[!(sum(status_code=="final")>1|
              sum(status_code=="sold") >1|
              sum(status_code=="error")>1)],
       by=group_id]


    group_id    id status_code
       <num> <int>      <char>
 1:        1     1     initial
 2:        1     2     initial
 3:        1     3       final
 4:        4     4     initial
 5:        4     5     initial
 6:        4     6     initial
 7:        4     7        sold
 8:        8     8     initial
 9:        8     9     initial
10:        8    10     initial
11:        8    11     initial
12:        8    12       error
13:       13    13     initial
14:       13    14     initial
15:       13    15       final
16:       16    16       final
17:       17    17       error