I have a vector and a dataframe that look like below:
class.interest <- c("a", "b", "c", "d", "e")
df <- data.frame("Subject" = c(rep("A",3), rep("B",3), rep("C",5), "D", "E"),
"Class" = c("a", "b", "f", "b", "b", "e", "a", "b", "c", "e", "f", "c", "f"))
What I'd like to do is count how many times the elements in class.interest
appear for each Subject
in df
. I hope to get something like this as a result:
# of occurrences count count.from.subject
0 1 E
1 1 D
2 2 A,B
3 0 NA
4 1 C
5 0 NA
To explain this output further,
the first column (# of occurences
) just refers to how many times the elements in class.interest
can possibly appear for each Subject
(so it'll always be a list of integers between zero and the length of class.interest
vector)
the second column (count
) was counted like this:
For subject A
, there are 3 entries in Class
, but only two elements in class.interest
(exluding f
). For subject B
, there are two elements (there are 3 entries in the data frame for B
, but "b"
occurs twice). In the same way, subject C
has 4 elements, D
has 1 element and E
has none.
So I'm just putting what I described above into the output now - for instance, only subject E
did not have any of the elements in class.interest
, hence the first row of output. Or both subject A
and B
had two elements in class.interest
, third row of count
becomes 2.
Can anyone give me any suggestions on how to count something this way?
edit:
Thanks for all the wonderful suggestions! Would it also be possible to preserve the information about Class
rather than just counting the frequencies using summarise
? This would make the output much larger and I'm not sure what a good format would be for rows that have count
higher than 1, maybe something like below?
# of occurrences count count.from.subject count.class
0 1 E NA
1 1 D c
2 2 A a,b
2 2 B b,e
3 0 NA NA
4 1 C a,b,c,e
5 0 NA NA
CodePudding user response:
You could also do:
df %>%
distinct() %>%
left_join(data.frame(Class= class.interest, count = 1)) %>%
group_by(Subject) %>%
summarise(count=sum(count, na.rm = TRUE))%>%
group_by(count) %>%
summarise(n=n(), Subject=toString(Subject))%>%
complete(count = 0:5, fill=list(n = 0))
# A tibble: 6 x 3
count n Subject
<dbl> <int> <chr>
1 0 1 E
2 1 1 D
3 2 2 A, B
4 3 0 NA
5 4 1 C
6 5 0 NA
CodePudding user response:
df %>% filter(Class != "f") %>%
group_by(Subject) %>%
summarise(`# of occurrences` = n_distinct(Class)) %>%
group_by(`# of occurrences`) %>%
summarise(count = length(Subject),
count.from.subject = paste(Subject, collapse = ","))
Edit:
You can use also use mutate
with group_by
instead of summarise
, which will append the same value to each element in the group:
(with complete you can extend the missing values)
df %>%
mutate(Class = na_if(Class, "f")) %>%
group_by(Subject) %>%
summarise(`# of occurrences` = n_distinct(na.omit(Class)),
count.class = na_if(paste(sort(unique(na.omit(Class))), collapse = ","), "")) %>%
group_by(`# of occurrences`) %>%
mutate(count = n()) %>% ungroup() %>%
complete(`# of occurrences` = 0:5, fill = list(count = 0)) %>%
transmute(`# of occurrences`, count, count.from.subject = Subject, count.class)
CodePudding user response:
You can do this:
left_join(
tibble(num_occurrence = 0:length(class.interest)),
group_by(df, Subject) %>%
summarize(num_occurrence = sum(unique(Class) %in% class.interest)) %>%
group_by(num_occurrence) %>%
summarize(count = n(), count.from_subject = toString(c(Subject)))
) %>% mutate(count = ifelse(is.na(count), 0, count))
Output:
num_occurrence count count.from_subject
<int> <dbl> <chr>
1 0 1 E
2 1 1 D
3 2 2 A, B
4 3 0 NA
5 4 1 C
6 5 0 NA