I'm trying to find the number of participants per gene at different time points. I'm attempting to do this with a nested for loop, however, I can't seem to figure it out. Here's something I've been trying:
IgH_CDR3_post_challenge_unique<- select(IgH_CDR3_post_challenge_unique, cdr3aa, gene, ID, Timepoint)
participant_list <- unique(IgH_CDR3_post_challenge_unique$gene)
time_list<- unique(IgH_CDR3_post_challenge_unique$Timepoint)
for (c in participant_list)
{
for(i in time_list)
{
IgH_CDR3_post_challenge_unique <- filter(IgH_CDR3_post_challenge_unique, Timepoint==time_list[i] )
}
IgH_CDR3_post_challenge_unique$participant_per_gene[IgH_CDR3_post_challenge_unique$gene == c] <- length(unique(IgH_CDR3_post_challenge_unique$ID[IgH_CDR3_post_challenge_unique$gene == c]))
}
I would like the loops to end up calculating the number of participants per gene for each timepoint.
My data looks something like this:
gene | Timepoint | ID |
---|---|---|
1 | C0 | SP1 |
2 | C1 | SP2 |
1 | C0 | SP4 |
3 | C0 | SP2 |
CodePudding user response:
This could be achieved without the use of a loop using dplyr
. Loops tend to get slow and cumbersome when your data becomes large.
First, use group_by
to group the data by the relevant column and then count the number of unique IDs within each group.
library(dplyr)
> dat %>% group_by(Timepoint, gene) %>% summarise(n = length(unique(ID)))
# A tibble: 2 × 2
Timepoint n
<chr> <int>
1 C0 3
2 C1 1