I have a dataset where I would like to count the occurences of a number of variables for multiple subsets. Ideally, this would be automatic.
My dataset looks somewhat like this:
var1 <- c("Checked", "Checked", "Unchecked")
var2 <- c("Unchecked", "Checked", "Unchecked")
var3 <- c("Checked", "Unchecked", "Unchecked")
varA <- c("Unchecked", "Checked", "Checked")
varB <- c("Unchecked", "Checked", "Checked")
varC <- c("Checked", "Unchecked", "Checked")
dummy <- cbind(var1,var2,var3,varA,varB,varC)
For everyone who checked box var1, I would like to count the amount of "checked" boxes for varA, varB and varC. Same for var2 and var3.
The ideal end result would be a dataframe that looks somewhat like this: where the rows indicate the subsets and the rows the counts of "Checked" for varA, varB and varC respectively.
varA varB varC
var1 1 1 1
var2 1 1 0
var3 0 0 1
Bonus points for being able to easily convert this to proportions (eta: of checked vs unchecked)!
I figured out I should convert the "Checked" "Unchecked" to 0 and 1, and these should be (converted to) numeric:
dummy[dummy == "Checked"] <- 1
dummy[dummy == "Unchecked"] <- 0
dummy <- as.data.frame(apply(dummy, 2, as.numeric))
dummy now looks like this, so far so good.
var1 var2 var3 varA varB varC
1 1 0 1 0 0 1
2 1 1 0 1 1 0
3 0 0 0 1 1 1
However, now I am stuck. I can of course manually calculate the sum of columns 4:6 with the subset function and compile all of that in a new dataframe, but since my real dataset has way more variables and subsets, this is not an ideal solution.
Thanks! First Q here, so I tried to be precise but will fine-tune the Q if needed :)
CodePudding user response:
You can do:
dummy <- data.frame(var1,var2,var3,varA,varB,varC)
dummy %>%
pivot_longer(cols = matches('\\d$')) %>%
group_by(name) %>%
summarize(across(starts_with('var'), ~sum(. == 'Checked' & value == 'Checked')))
# A tibble: 3 x 4
name varA varB varC
<chr> <int> <int> <int>
1 var1 1 1 1
2 var2 1 1 0
3 var3 0 0 1