Background
I've got a dataframe df
:
> df <- data.frame(ID = c("a","a","a", "b", "c","c","c","c","d","d","d","d"),
treatment = c(0,1,1,0,0,0,0,0,0,0,0,1),
event = c(0,1,0,0,1,1,1,1,0,0,1,0),
stringsAsFactors=FALSE)
> df
ID treatment event
1 a 0 0
2 a 1 1
3 a 1 0
4 b 0 0
5 c 0 1
6 c 0 1
7 c 0 1
8 c 0 1
9 d 0 0
10 d 0 0
11 d 0 1
12 d 1 0
It's got 4 people (ID
) in it: a, b, c, and d. Each ID
can have one record, like b
does, or several records, like the other 3 do. They each have values for treatment
, which is binary, and event
, which is binary too.
The problem
I want to count how many distinct ID
s who have ANY treatment==1
or NO treatment==1
ever (in other words, all treatment==0
) also have event==1
and event==0
. Here's what I'd like the result to look like:
> [some R code]
event treatment n
0 0 1
0 1 0
1 0 1
1 1 2
To break this out into English:
ID
s a and d and have got ANYtreatment==1
and ANYevent==1
ID
b has got notreatment
EVER and noevent
EVERID
c has notreatment==1
ever (i.e. alltreatment==0
) but does have anyevent==1
(as you can see, he's actually got ALLtreatment==1
)
What I've tried
I know I'm close, but the dplyr
code I'm using is getting tripped up by the fact that IDs
have rows with both treatment==1
and treatment==0
, and the same for event
-- so R
is double-counting everyone, showing me with 8 people instead of 4:
df %>%
group_by(event, treatment) %>%
distinct(ID) %>%
count(event) %>%
as.data.frame() %>%
mutate(Percent = round((n/sum(n))*100, digits = 1))
event treatment n Percent
1 0 0 3 37.5
2 0 1 2 25.0
3 1 0 2 25.0
4 1 1 1 12.5
I'm not married to dplyr
on this, but it's what I'm most familiar with (as opposed to data.table
, for instance). The real dataset this code is going to be run on has several million rows, just FYI.
CodePudding user response:
You may try
df %>%
group_by(ID) %>%
summarize(treatment = as.numeric(sum(treatment) > 0),
event = as.numeric(sum(event) > 0)) %>%
select(-ID) %>%
count(treatment, event)
treatment event n
<dbl> <dbl> <int>
1 0 0 1
2 0 1 1
3 1 1 2
CodePudding user response:
In base R:
data.frame(t(table(aggregate(.~ID, df, \(x) (sum(x)>0))[-1])))
event treatment Freq
1 0 0 1
2 1 0 1
3 0 1 0
4 1 1 2