Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
group<-c(1,1,1,1,2,2,3,3,3,3,4,4)
individualID<-c(1,1,2,2,3,3,5,5,6,6,7,7)
X<-rbinom(12,1,0.5)
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data without use of dplyr package to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.
CodePudding user response:
There are more concise ways for sure, but here is the general idea.
# use your code to get the counts by group
df1_counts <- aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
# create a vector of groups where the count is > 1
keep_groups <- df1_counts$group[df1_counts$individualID > 1]
# filter the rows to only groups you want to keep
df1[df1$group %in% keep_groups,]
# group individualID X
# 1 1 1 0
# 2 1 1 0
# 3 1 2 1
# 4 1 2 0
# 7 3 5 1
# 8 3 5 1
# 9 3 6 0
# 10 3 6 1
CodePudding user response:
Or another option is with tidyverse
- after grouping by 'group', filter
the rows where the number of distinct (n_distinct
) elements in 'individualID' is greater than 1
library(dplyr)
df1 %>%
group_by(group) %>%
filter(n_distinct(individualID) > 1) %>%
ungroup
# A tibble: 8 × 3
group individualID X
<dbl> <dbl> <int>
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
5 3 5 0
6 3 5 0
7 3 6 1
8 3 6 0
Or with subset
and ave
from base R
subset(df1, ave(individualID, group, FUN = function(x) length(unique(x))) > 1)
group individualID X
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
7 3 5 0
8 3 5 0
9 3 6 1
10 3 6 0