I'm trying to sort some data, which is labelled with an id, group_id, and a status code. Here's an example:
library(tidyverse)
sample <- tibble(id = c(1:20),
group_id = c(rep(1,3),rep(4,4),rep(8,5),rep(13,3),rep(16,1),rep(17,1),rep(18,3)),
status_code = c("initial","initial","final","initial","initial","initial","sold","initial","initial","initial","initial","error",
"initial","initial","final","final","error","initial","final","final")
)
sample %>%
head()
# A tibble: 6 x 3
id group_id status_code
<int> <dbl> <chr>
1 1 1 initial
2 2 1 initial
3 3 1 final
4 4 4 initial
5 5 4 initial
6 6 4 initial
What I would like to do is group them by group_id
and retain only observations that meet the following criteria:
- status code (final, sold, error) appears more than once in the
group_id
- there is more than one occurence of either "final", "sold", or "error"
From my example above, group_id == 18
should be filtered because the code"final" appears twice. But this should also be the case had one of the "final" codes been replaced with either "sold", or "error".
Essentially, I'd like to be able to easily split the data as follows:
# Conditions met
# A tibble: 3 x 3
id group_id status_code
<int> <dbl> <chr>
1 18 18 initial
2 19 18 final
3 20 18 final
# Conditions not met
# A tibble: 17 x 3
id group_id status_code
<int> <dbl> <chr>
1 1 1 initial
2 2 1 initial
3 3 1 final
4 4 4 initial
5 5 4 initial
6 6 4 initial
7 7 4 sold
8 8 8 initial
9 9 8 initial
10 10 8 initial
11 11 8 initial
12 12 8 error
13 13 13 initial
14 14 13 initial
15 15 13 final
16 16 16 final
17 17 17 error
My data obviously has more columns, which should all be retained in each subframe. So far I've been able to achieve this by counting, keeping the status_code and filtering the main data again:
condition <- sample %>%
filter(status_code != "initial") %>% # Removing "initial" status_code
count(group_id) %>% # Counting the number of remaining occurences in each group
filter(n > 1) %>% # Keeping groups where more than one occurence
select(group_id) # Selecting the group_ids
fitlered_sample <- left_join(condition,sample, by ="group_id")
# A tibble: 3 x 3
group_id id status_code
<dbl> <int> <chr>
1 18 18 initial
2 18 19 final
3 18 20 final
It works, but I was hoping this could be achieved without creating a "condition" dataframe. I'd prefer a dplyr
solution, but I can also work with base or data.table
CodePudding user response:
We may group_by 'group_id' and then filter
on the count of 'status_code' not equal to 'initial'
library(dplyr)
sample %>%
group_by(group_id) %>%
filter(sum(status_code != "initial") > 1) %>%
ungroup
For the other data
sample %>%
group_by(group_id) %>%
filter(sum(status_code != "initial") == 1) %>%
ungroup
Or use group_split
to create a list
of datasets
lst1 <- sample %>%
group_by(group_id) %>%
mutate(n = sum(status_code != "initial") != 1) %>%
ungroup %>%
group_split(n, .keep = FALSE)
-output
lst1
[[1]]
# A tibble: 17 × 3
id group_id status_code
<int> <dbl> <chr>
1 1 1 initial
2 2 1 initial
3 3 1 final
4 4 4 initial
5 5 4 initial
6 6 4 initial
7 7 4 sold
8 8 8 initial
9 9 8 initial
10 10 8 initial
11 11 8 initial
12 12 8 error
13 13 13 initial
14 14 13 initial
15 15 13 final
16 16 16 final
17 17 17 error
[[2]]
# A tibble: 3 × 3
id group_id status_code
<int> <dbl> <chr>
1 18 18 initial
2 19 18 final
3 20 18 final
CodePudding user response:
With data.table
:
library(data.table)
setDT(sample)
sample[,.SD[sum(status_code=="final")>1|
sum(status_code=="sold") >1|
sum(status_code=="error")>1],
by=group_id]
group_id id status_code
<num> <int> <char>
1: 18 18 initial
2: 18 19 final
3: 18 20 final
An the opposite:
sample[,.SD[!(sum(status_code=="final")>1|
sum(status_code=="sold") >1|
sum(status_code=="error")>1)],
by=group_id]
group_id id status_code
<num> <int> <char>
1: 1 1 initial
2: 1 2 initial
3: 1 3 final
4: 4 4 initial
5: 4 5 initial
6: 4 6 initial
7: 4 7 sold
8: 8 8 initial
9: 8 9 initial
10: 8 10 initial
11: 8 11 initial
12: 8 12 error
13: 13 13 initial
14: 13 14 initial
15: 13 15 final
16: 16 16 final
17: 17 17 error