df<-data.frame(id=c(1,2,3,4,3,4,3,1,1,2),event=c('a','a','a','a','a','a','a','b','b','b'),var1=c(1,1,0,0,1,0,1,1,0,1),var2=c(1,0,0,0,1,1,0,1,1,0),var_total=c(1,1,0,0,1,1,1,1,1,1))
df
id event var1 var2 var_total
1 1 a 1 1 1
2 2 a 1 0 1
3 3 a 0 0 0
4 4 a 0 0 0
5 3 a 1 1 1
6 4 a 0 1 1
7 3 a 1 0 1
8 1 b 1 1 1
9 1 b 0 1 1
10 2 b 1 0 1
I am reformatting/cleaning data from a data entry site that produces very diagonal data. I have got it into a manageable form at the moment, but I still have one problem. There are events that repeat, and I would like each id/event combination to be unique. As you can see, lines 5 and 6 are duplicate in that aspect, but the variables are not identical. The variables are a binary response (yes=1,no=0), but if there is any yes in the event the variable should be 1. Additionally, the 'var_total' column should be 1 if ANY of the variables are positive.
My data set has 77 of these 'repeat events' out of over 6000 entries, and it's likely to change everytime more data is entered. How do I isolate ids with repeat events so I can aggregate() (?summarise) them and be sure it's done correctly? There are over 15 variables. I need to report number of ids per event for all variables.
CodePudding user response:
library(dplyr)
df %>%
group_by(id, event) %>%
summarize(
across(var1:var2, ~ any(. > 0)),
var_total = ((var1 var2) > 0),
.groups = "drop"
) %>%
arrange(event, id)
# # A tibble: 6 x 5
# id event var1 var2 var_total
# <dbl> <chr> <int> <int> <int>
# 1 1 a 1 1 1
# 2 2 a 1 0 1
# 3 3 a 1 1 1
# 4 4 a 0 1 1
# 5 1 b 1 1 1
# 6 2 b 1 0 1
Notes:
- The
arrange
is purely getting it back to the order you had in your question, not required for the operation of the code. summarize
is wiping outvar_total
and then recreating it based on the logic you stated (eithervar*
is 1).- I could have easily used
across(var1:var2, max)
instead of the~ any(. > 0)
, it produces the same results here. I shows the~ any
version purely to demonstrating something a little more complex thanmax
.