Background
I've got this R
dataframe, d
:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","G12","G12","B4","B4","A24","L5","L5"),
stringsAsFactors=FALSE)
It looks like this:
As you can see, it's got 2 distinct ID
's in it, with each one having events
, some of which repeat / are duplicated any number of times.
The Problem
I'd like to figure out what the average number of repeated event
is per ID
in this dataframe.
At a glance, you see that id
= a has 2 events that repeat -- G12, which repeats twice (with 3 entries total) and B4, which repeats once (with 2 total entries). id
= b has 1 event that repeats: L5. Note that how many times each repeat/duplicate occurs is irrelevant to me here; it just matters that there's at least one duplicate event
per ID
.
So the result I want is a simple tabulation of that mean:
(2 events that repeat 1 event that repeats) / 2 people = 1.5
What I've tried
I've gotten kinda close thanks to posts like this, but I'm not quite there:
d %>% summarise(mean = mean(duplicated(event)))
This runs, but it doesn't factor in the fact that the duplication is occurring within ID
(or at least, that's the way I see it).
CodePudding user response:
Is this what you want?
d %>%
group_by(event , ID) %>%
dplyr::summarise(n = n()) %>%
filter(n > 1) %>%
group_by(ID) %>%
dplyr::summarise(n = n()) %>%
summarise(result = mean(n))
result
1 1.5
for d
ID event
1 Bill 1
2 Bill 1
3 Bill 2
4 Bill 2
5 Bill 3
6 Bill 3
7 Bill 4
8 Bill 4
9 Sarah 1
10 Sarah 1
11 Sarah 2
12 Sarah 2
13 Sarah 3
14 Sarah 3
15 Sarah 4
16 Sarah 4
17 Sarah 5
18 Sarah 5
19 Jane 1
20 Jane 1
result
1 3.333333
little checkup for code
d %>%
group_by(event , ID) %>%
dplyr::summarise(n = n()) %>%
filter(n > 1) %>%
group_by(ID) %>%
dplyr::summarise(n = n())
ID n
<chr> <int>
1 Bill 4
2 Jane 1
3 Sarah 5
CodePudding user response:
For each ID
you may count the event
which occur more than once and then take the ratio.
library(dplyr)
d %>%
group_by(ID) %>%
summarise(dup = sum(table(event) > 1)) %>%
summarise(ratio = mean(dup))
# ratio
# <dbl>
#1 1.5