In R, what's the average number of repeated column values per ID?-CodePudding

Background

I've got this R dataframe, d:

d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
                event = c("G12","G12","G12","B4","B4","A24","L5","L5"),
                stringsAsFactors=FALSE)

It looks like this:

As you can see, it's got 2 distinct ID's in it, with each one having events, some of which repeat / are duplicated any number of times.

The Problem

I'd like to figure out what the average number of repeated event is per ID in this dataframe.

At a glance, you see that id= a has 2 events that repeat -- G12, which repeats twice (with 3 entries total) and B4, which repeats once (with 2 total entries). id= b has 1 event that repeats: L5. Note that how many times each repeat/duplicate occurs is irrelevant to me here; it just matters that there's at least one duplicate event per ID.

So the result I want is a simple tabulation of that mean:

(2 events that repeat 1 event that repeats) / 2 people = 1.5

What I've tried

I've gotten kinda close thanks to posts like this, but I'm not quite there:

d %>% summarise(mean = mean(duplicated(event)))

This runs, but it doesn't factor in the fact that the duplication is occurring within ID (or at least, that's the way I see it).

CodePudding user response：

Is this what you want?

d %>%
  group_by(event , ID) %>%
  dplyr::summarise(n = n()) %>%
  filter(n > 1) %>%
  group_by(ID) %>%
  dplyr::summarise(n = n()) %>%
  summarise(result = mean(n))

  result
1    1.5

for d

  ID event
1   Bill     1
2   Bill     1
3   Bill     2
4   Bill     2
5   Bill     3
6   Bill     3
7   Bill     4
8   Bill     4
9  Sarah     1
10 Sarah     1
11 Sarah     2
12 Sarah     2
13 Sarah     3
14 Sarah     3
15 Sarah     4
16 Sarah     4
17 Sarah     5
18 Sarah     5
19  Jane     1
20  Jane     1

    result
1 3.333333

little checkup for code

d %>%
  group_by(event , ID) %>%
  dplyr::summarise(n = n()) %>%
  filter(n > 1) %>%
  group_by(ID) %>%
  dplyr::summarise(n = n())

  ID        n
  <chr> <int>
1 Bill      4
2 Jane      1
3 Sarah     5

CodePudding user response：

For each ID you may count the event which occur more than once and then take the ratio.

library(dplyr)

d %>%
  group_by(ID) %>%
  summarise(dup = sum(table(event) > 1)) %>%
  summarise(ratio = mean(dup))

#  ratio
#  <dbl>
#1   1.5