Home > Enterprise >  In R, conditionally count IDs who meet an ANY condition on a certain attribute
In R, conditionally count IDs who meet an ANY condition on a certain attribute

Time:06-09

Background

I've got a dataframe df:

> df <- data.frame(ID = c("a","a","a", "b", "c","c","c","c","d","d","d","d"),
                 treatment = c(0,1,1,0,0,0,0,0,0,0,0,1),
                 event = c(0,1,0,0,1,1,1,1,0,0,1,0),
                 stringsAsFactors=FALSE) 
> df
   ID treatment event
1   a         0     0
2   a         1     1
3   a         1     0
4   b         0     0
5   c         0     1
6   c         0     1
7   c         0     1
8   c         0     1
9   d         0     0
10  d         0     0
11  d         0     1
12  d         1     0

It's got 4 people (ID) in it: a, b, c, and d. Each ID can have one record, like b does, or several records, like the other 3 do. They each have values for treatment, which is binary, and event, which is binary too.

The problem

I want to count how many distinct IDs who have ANY treatment==1 or NO treatment==1 ever (in other words, all treatment==0) also have event==1 and event==0. Here's what I'd like the result to look like:

> [some R code]

event  treatment  n
    0          0  1           
    0          1  0
    1          0  1
    1          1  2

To break this out into English:

  • IDs a and d and have got ANY treatment==1 and ANY event==1
  • ID b has got no treatment EVER and no event EVER
  • ID c has no treatment==1 ever (i.e. all treatment==0) but does have any event==1 (as you can see, he's actually got ALL treatment==1)

What I've tried

I know I'm close, but the dplyr code I'm using is getting tripped up by the fact that IDs have rows with both treatment==1 and treatment==0, and the same for event -- so R is double-counting everyone, showing me with 8 people instead of 4:

df %>% 
  group_by(event, treatment) %>%
  distinct(ID) %>%
  count(event) %>%
  as.data.frame() %>%
  mutate(Percent = round((n/sum(n))*100, digits = 1))

  event treatment n Percent
1     0         0 3    37.5
2     0         1 2    25.0
3     1         0 2    25.0
4     1         1 1    12.5

I'm not married to dplyr on this, but it's what I'm most familiar with (as opposed to data.table, for instance). The real dataset this code is going to be run on has several million rows, just FYI.

CodePudding user response:

You may try

df %>%
  group_by(ID) %>%
  summarize(treatment = as.numeric(sum(treatment) > 0), 
            event = as.numeric(sum(event) > 0)) %>%
  select(-ID) %>%
  count(treatment, event)

  treatment event     n
      <dbl> <dbl> <int>
1         0     0     1
2         0     1     1
3         1     1     2

CodePudding user response:

In base R:

data.frame(t(table(aggregate(.~ID, df, \(x) (sum(x)>0))[-1])))

  event treatment Freq
1     0         0    1
2     1         0    1
3     0         1    0
4     1         1    2
  • Related