I have in my dataset a variable called APP_ID which values can be repeated, and several columns which names start by “NICE_”. I’m trying to count, by APP_ID, the number of different “NICE_” columns that contains a value different than zero. For instance, for the following APP_ID the value should be 7 because there are ones in 7 different columns.:
APP_ID | NICE_1 | NICE_2 | NICE_3 | NICE_4 | NICE_5 | NICE_6 | NICE_7 | NICE_8 | NICE_9 | NICE_10 | NICE_11 | NICE_12 | NICE_13 | NICE_14 | NICE_15 | NICE_16 | NICE_17 | NICE_18 | NICE_19 | NICE_20 | NICE_21 | NICE_22 | NICE_23 | NICE_24 | NICE_25 | NICE_26 | NICE_27 | NICE_28 | NICE_29 | NICE_30 | NICE_31 | NICE_32 | NICE_33 | NICE_34 | NICE_35 | NICE_36 | NICE_37 | NICE_38 | NICE_39 | NICE_40 | NICE_41 | NICE_42 | NICE_43 | NICE_44 | NICE_45 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The point is that I managed to do so, by using:
data_QI %>%
arrange(APP_ID) %>% # Just to have them in order
select(APP_ID, starts_with("NICE_")) %>% # Select only the variables I’ll work with
group_by(APP_ID) %>%
summarize_all(list(sum)) %>%
mutate(APPNICENUM = rowSums(. != 0) - 1 ) # -1 to not count the column APP_ID.
The problem comes when I try to ungroup the observations. I mean, I need the repeated APP_ID to have the same APPNICENUM. However, adding ungroup() to the pipe doesn’t seem to work.
I wonder if there is a way to ungroup the data frame once all those operations are done. Or maybe there is a different (and simplier?) approach to my problem.
CodePudding user response:
summarise_all
will automatically remove the grouping, so there is no need for ungroup
afterwards:
library(tidyverse)
data <- tribble(
~APP_ID, ~NICE_1, ~NICE_2,
38, 0, 0,
38, 0, 0,
38, 1, 1
)
data
#> # A tibble: 3 x 3
#> APP_ID NICE_1 NICE_2
#> <dbl> <dbl> <dbl>
#> 1 38 0 0
#> 2 38 0 0
#> 3 38 1 1
result <-
data %>%
group_by(APP_ID) %>%
summarize_all(list(sum)) %>%
# -1 to not count APP_ID
mutate(APPNICENUM = rowSums(. != 0) - 1)
result
#> # A tibble: 1 x 4
#> APP_ID NICE_1 NICE_2 APPNICENUM
#> <dbl> <dbl> <dbl> <dbl>
#> 1 38 1 1 2
result %>% groups()
#> list()
Created on 2021-10-06 by the reprex package (v2.0.1)
CodePudding user response:
Do you want to keep the same number of rows as your original data but just add a new column which has count of columns that have 1 in it for each APP_ID
?
Try this answer -
library(dplyr)
result <- data_QI %>%
group_by(APP_ID) %>%
mutate(res = sum(colSums(select(cur_data(), starts_with('NICE'))) > 0)) %>%
ungroup
result
CodePudding user response:
We may do
library(dplyr)
data_QI %>%
group_by(APP_ID) %>%
summarise(res = sum(across(everything(), sum)), .groups = 'drop')