I am trying to create a count of the number of observations and extract percentages from those counts.
I have data that looks like this:
UserID substance_use
43124 0
43124 1
43124 0
43124 0
43124 1
215 1
215 1
I want a count of the total observations of substance use like this:
UserID substance_use count
43124 0 5
43124 1 5
43124 0 5
43124 0 5
43124 1 5
215 1 2
215 1 2
And then, I want the % of visits that participants said yes and no, like this:
UserID substance_use count percent_yes percent_no
43124 0 5 40% 60%
43124 1 5 40% 60%
43124 0 5 40% 60%
43124 0 5 40% 60%
43124 1 5 40% 0%
215 1 2 100% 0
215 1 2 100% 0%
I tried using the count function for the first part, but it's not taking into account the 0's. Any help would be greatly appreciated.
CodePudding user response:
We may use add_count
to create the 'count' column and then get the mean
of the binary column by group to return the percent_yes
and subtract 100 from the percent_yes to return percent_no
library(dplyr)
df1 %>%
add_count(UserID, name = 'count') %>%
group_by(UserID) %>%
mutate(percent_yes = 100 * mean(substance_use),
percent_no = 100 - percent_yes) %>%
ungroup
-output
# A tibble: 7 × 5
UserID substance_use count percent_yes percent_no
<int> <int> <int> <dbl> <dbl>
1 43124 0 5 40 60
2 43124 1 5 40 60
3 43124 0 5 40 60
4 43124 0 5 40 60
5 43124 1 5 40 60
6 215 1 2 100 0
7 215 1 2 100 0
NOTE: Here, we assumed no missing values in 'substance_use' column
data
df1 <- structure(list(UserID = c(43124L, 43124L, 43124L, 43124L, 43124L,
215L, 215L), substance_use = c(0L, 1L, 0L, 0L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))
CodePudding user response:
Not sure if you need the aggregated (!) info in each row or rather want to summarize per ID, in which case you can do: (using the same df1 data as provided in akrun‘s answer)
library(tidyverse)
library(scales)
df1 %>%
group_by(UserID) %>%
summarize(count = n(),
percent_yes = percent(sum(substance_use == 1)/count),
percent_no = percent(sum(substance_use == 0)/count))
which gives:
# A tibble: 2 x 4
UserID count percent_yes percent_no
<int> <int> <chr> <chr>
1 215 2 100% 0%
2 43124 5 40% 60%