I have a dataset of Reddit users and their posts, and I am trying to create an indicator variable that is coded 1 if the user has a number of posts that are in the 80th percentile, and 0 otherwise. I am essentially interested in categorizing users into "active" versus "passive" users.
I have created a variable that counts the number of posts by username:
df <-
df %>% group_by(username) %>% mutate(count = n())
#count(username, sort = TRUE)
Here is a data example:
df %>%
select(username, count) %>%
head(., 4)
output:
username
cyz
crash
conan
xyz
<chr>
count
14
12
7
13
<int>
I have tried the following to identify users with a number of posts in the top 20th percentile:
df %>%
group_by(username) %>%
do(tidy(t(quantile(.$count))))
Here is a data example for the variable "count", which counts the number of posts per row.
dput(df$count)
output:
c(15L, 9L, 1L, 1L, 1L, 1L, 1L, 1L, 15L, 15L, 15L, 1L, 15L, 1L,
1L, 15L, 1L, 1L, 15L, 2L, 15L, 1L, 15L, 1L, 15L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 15L, 191L, 3L, 191L,
191L, 1L, 191L, 191L, 2L, 191L, 191L, 1L, 191L, 1L, 191L, 191L,
191L, 3L, 191L, 98L, 191L, 1L, 191L, 2L, 191L, 9L, 1L, 191L,
1L, 1L, 3L, 191L, 191L, 191L, 2L, 3L, 1L, 1L, 2L, 2L, 191L, 191L,
191L, 191L, 17L, 1L, 3L, 4L, 3L, 22L, 2L, 3L, 3L, 191L)
CodePudding user response:
You could use mutate to get a new column with activity coded as you expected.
EDIT: updated the dataframe with the supplied dput for count variable.
df <- data.frame(ID = as.character(1:92),
count = count)
df_with_activity <- df %>%
mutate(active = ifelse(count >= quantile(count, 0.8), 1, 0))
ID count active
1 1 15 0
2 2 9 0
3 3 1 0
4 4 1 0
5 5 1 0
6 6 1 0
7 7 1 0
8 8 1 0
9 9 15 0
10 10 15 0
11 11 15 0
12 12 1 0
13 13 15 0
14 14 1 0
15 15 1 0
16 16 15 0
17 17 1 0
18 18 1 0
19 19 15 0
20 20 2 0
21 21 15 0
22 22 1 0
23 23 15 0
24 24 1 0
25 25 15 0
26 26 2 0
27 27 1 0
28 28 1 0
29 29 1 0
30 30 1 0
31 31 1 0
32 32 1 0
33 33 1 0
34 34 1 0
35 35 1 0
36 36 1 0
37 37 1 0
38 38 3 0
39 39 15 0
40 40 191 1
41 41 3 0
42 42 191 1
43 43 191 1
44 44 1 0
45 45 191 1
46 46 191 1
47 47 2 0
48 48 191 1
49 49 191 1
50 50 1 0
51 51 191 1
52 52 1 0
53 53 191 1
54 54 191 1
55 55 191 1
56 56 3 0
57 57 191 1
58 58 98 0
59 59 191 1
60 60 1 0
61 61 191 1
62 62 2 0
63 63 191 1
64 64 9 0
65 65 1 0
66 66 191 1
67 67 1 0
68 68 1 0
69 69 3 0
70 70 191 1
71 71 191 1
72 72 191 1
73 73 2 0
74 74 3 0
75 75 1 0
76 76 1 0
77 77 2 0
78 78 2 0
79 79 191 1
80 80 191 1
81 81 191 1
82 82 191 1
83 83 17 0
84 84 1 0
85 85 3 0
86 86 4 0
87 87 3 0
88 88 22 0
89 89 2 0
90 90 3 0
91 91 3 0
92 92 191 1
And these are the ones that should be labelled active:
df_with_activity %>%
filter(active == 1)
ID count active
1 40 191 1
2 42 191 1
3 43 191 1
4 45 191 1
5 46 191 1
6 48 191 1
7 49 191 1
8 51 191 1
9 53 191 1
10 54 191 1
11 55 191 1
12 57 191 1
13 59 191 1
14 61 191 1
15 63 191 1
16 66 191 1
17 70 191 1
18 71 191 1
19 72 191 1
20 79 191 1
21 80 191 1
22 81 191 1
23 82 191 1
24 92 191 1