I have a data frame:
df <- structure(list(ID = c("x1", "x1", "x1", "x1", "x1", "x1", "x2", "x2", "x2", "x2", "x2", "x2", "x3", "x3", "x3", "x3", "x3", "x3", "x1", "x1", "x1", "x1", "x1", "x1", "x2", "x2", "x2", "x2", "x2", "x2", "x3", "x3", "x3", "x3", "x3", "x3"), col1=c("a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a1","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2","a2"), col2 = c("a", "b", "c", "d", "e", "f", "a", "b", "c", "d", "e", "f","a", "b", "c", "d", "e", "f","a", "b", "c", "d", "e", "f", "a", "b", "c", "d", "e", "f","a", "b", "c", "d", "e", "f"), col3 = c(2,13,1,21,0,5,3,0,6,4,50,0,0,0,0,9,5,0,51,3,6,0,0,9,89,4,29,1,4,17,6,16,9,1,0,0)),
class = "data.frame", row.names = c(NA,-36L))
ID col1 col2 col3
x1 a1 a 2
x1 a1 b 13
x1 a1 c 1
x1 a1 d 21
x1 a1 e 0
x1 a1 f 5
x2 a1 a 3
x2 a1 b 0
x2 a1 c 6
x2 a1 d 4
x2 a1 e 50
x2 a1 f 0
x3 a1 a 0
x3 a1 b 0
x3 a1 c 0
x3 a1 d 9
x3 a1 e 5
x3 a1 f 0
x1 a2 a 51
x1 a2 b 3
x1 a2 c 6
x1 a2 d 0
x1 a2 e 0
x1 a2 f 9
x2 a2 a 89
x2 a2 b 4
x2 a2 c 29
x2 a2 d 1
x2 a2 e 4
x2 a2 f 17
x3 a2 a 6
x3 a2 b 16
x3 a2 c 9
x3 a2 d 1
x3 a2 e 0
x3 a2 f 0
I want to count the unique IDs that have "a", "b" or "c" >0 (more than zero), then "d" or "e" >0, and finally "f">0. Then get the sum of all (abc), (de) and (f) separately in a different column. So the result would look like the following:
df2<- structure(list(col1=c("a1","a1","a1","a2","a2","a2"), col2 = c("abc", "de", "f", "abc", "de", "f"), count.ID = c(2,3,1,3,2,2), total=c(25,89,5,213,6,26)),
class = "data.frame", row.names = c(NA,-6L))
col1 col2 count.ID total
a1 abc 2 25
a1 de 3 89
a1 f 1 5
a2 abc 3 213
a2 de 2 6
a2 f 2 26
How is this possible in R? Thanks
CodePudding user response:
One premise would be to create a frame of grouping variables mapping old col2
to the new combined col2
, and then merge
/join
it to the original data.
dplyr
library(dplyr)
groups <- data.frame(col2=c("a","b","c","d","e","f"), col2b=c("abc","abc","abc","de","de","f"))
left_join(df, groups, by = "col2") %>%
group_by(col1, col2 = col2b) %>%
summarize(count.ID = length(unique(ID[col3 > 0])), total = sum(col3)) %>%
ungroup()
# # A tibble: 6 x 4
# col1 col2 count.ID total
# <chr> <chr> <int> <dbl>
# 1 a1 abc 2 25
# 2 a1 de 3 89
# 3 a1 f 1 5
# 4 a2 abc 3 213
# 5 a2 de 2 6
# 6 a2 f 2 26