aggregating repeat entries prior to summarizing on other variables-CodePudding

df<-data.frame(id=c(1,2,3,4,3,4,3,1,1,2),event=c('a','a','a','a','a','a','a','b','b','b'),var1=c(1,1,0,0,1,0,1,1,0,1),var2=c(1,0,0,0,1,1,0,1,1,0),var_total=c(1,1,0,0,1,1,1,1,1,1))
df
   id event var1 var2 var_total
1   1     a    1    1         1
2   2     a    1    0         1
3   3     a    0    0         0
4   4     a    0    0         0
5   3     a    1    1         1
6   4     a    0    1         1
7   3     a    1    0         1
8   1     b    1    1         1
9   1     b    0    1         1
10  2     b    1    0         1

I am reformatting/cleaning data from a data entry site that produces very diagonal data. I have got it into a manageable form at the moment, but I still have one problem. There are events that repeat, and I would like each id/event combination to be unique. As you can see, lines 5 and 6 are duplicate in that aspect, but the variables are not identical. The variables are a binary response (yes=1,no=0), but if there is any yes in the event the variable should be 1. Additionally, the 'var_total' column should be 1 if ANY of the variables are positive.

My data set has 77 of these 'repeat events' out of over 6000 entries, and it's likely to change everytime more data is entered. How do I isolate ids with repeat events so I can aggregate() (?summarise) them and be sure it's done correctly? There are over 15 variables. I need to report number of ids per event for all variables.

CodePudding user response：

library(dplyr)
df %>%
  group_by(id, event) %>%
  summarize(
    across(var1:var2, ~  any(. > 0)),
    var_total =  ((var1   var2) > 0),
    .groups = "drop"
  ) %>%
  arrange(event, id)
# # A tibble: 6 x 5
#      id event  var1  var2 var_total
#   <dbl> <chr> <int> <int>     <int>
# 1     1 a         1     1         1
# 2     2 a         1     0         1
# 3     3 a         1     1         1
# 4     4 a         0     1         1
# 5     1 b         1     1         1
# 6     2 b         1     0         1

Notes:

The arrange is purely getting it back to the order you had in your question, not required for the operation of the code.
summarize is wiping out var_total and then recreating it based on the logic you stated (either var* is 1).
I could have easily used across(var1:var2, max) instead of the ~ any(. > 0), it produces the same results here. I shows the ~ any version purely to demonstrating something a little more complex than max.