I have a dataset with names from some variables corresponding to each month (e.g. var1_M1, var1_M2 var2_M1 ...) and I need to summarize by each month (after conditional) and identify the first case when I find a value > 90 and finally indicate the corresponding month.
library(dplyr)
ID <- c("Dave", "Joe", "Steve")
var1_M1 <- c(10, 10, 90)
var1_M2 <- c(30, 90, 95)
var1_M3 <- c(90, 100, 95)
var2_M1 <- c(10, 90, 20)
var2_M2 <- c(33, 10, 100)
var2_M3 <- c(90, 10, 50)
data <- tibble(ID, var1_M1, var1_M2, var1_M3, var2_M1, var2_M2, var2_M3)
# A tibble: 3 x 7
ID var1_M1 var1_M2 var1_M3 var2_M1 var2_M2 var2_M3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Dave 10 30 90 10 33 90
2 Joe 10 90 100 90 10 10
3 Steve 90 95 95 20 100 50
data %>%
mutate_at(vars(matches(c("M1","M2","M3"))),
list(~ ifelse(. > 90, .,0))) #%>%
#mutate() using map_dfc and if_any ??
Expected output
# A tibble: 3 x 5
ID M1 M2 M3 output
<chr> <dbl> <dbl> <dbl> <dbl>
1 Dave 0 0 0 0
2 Joe 0 0 100 3
3 Steve 0 195 95 2
CodePudding user response:
We may reshape to 'long' format with pivot_longer
, grouped by 'ID', get the sum
across
the columns, where the value is greater than 90, and create the output
of the first column index value not 0 with max.col
library(tidyr)
library(dplyr)
pivot_longer(data, cols = -ID, names_to = c(".value"),
names_pattern = ".*_(.*)") %>%
group_by(ID) %>%
summarise(across(everything(), ~ sum(.x[.x > 90]))) %>%
mutate(output = max.col(across(starts_with("M")) >0, "first") *
if_any(starts_with("M"), ~ .x > 0))
-output
# A tibble: 3 × 5
ID M1 M2 M3 output
<chr> <dbl> <dbl> <dbl> <int>
1 Dave 0 0 0 0
2 Joe 0 0 100 3
3 Steve 0 195 95 2
With pivot_longer
, we specified a regex in names_pattern
to capture the characters after the _
as a group ((...)
) in column names i.e. M1
, M2
, M3
) using .*
- to match zero or more characters, and specify the names_to
as .value
so that it reshape to long format with the column names as M1
, M2
, M3
combining all the var_
columns.
After we summarise
with the sum
, use max.col
to find the first index of column values greater than 0. In the first row, i.e. for Dave id, all the values are 0, thus max.col
returns the index 1. In order to make it 0, multiply with TRUE/FALSE output from if_any
(TRUE
-> 1, FALSE
-> 0). i.e. it may be better to create a column to understand it better
pivot_longer(data, cols = -ID, names_to = c(".value"),
names_pattern = ".*_(.*)") %>%
group_by(ID) %>%
summarise(across(everything(), ~ sum(.x[.x > 90]))) %>%
mutate(if_any_ind = if_any(starts_with("M"), ~ .x > 0),
output_ind =max.col(across(starts_with("M")) >0, "first"),
output = if_any_ind * output_ind)
-output
# A tibble: 3 × 7
ID M1 M2 M3 if_any_ind output_ind output
<chr> <dbl> <dbl> <dbl> <lgl> <int> <int>
1 Dave 0 0 0 FALSE 1 0
2 Joe 0 0 100 TRUE 3 3
3 Steve 0 195 95 TRUE 2 2
ie. any value x 0 -> 0 and value x 1 -> 1