I am working with the R programming language.
I am trying to create a game similar to the "Guess Who Game" (https://en.wikipedia.org/wiki/Guess_Who?) - a game in which players try to narrow down an in-game character based down a series of guesses.
Here is a dataset I simulated that contains the "counts" for athletes having different characteristics:
hair_color = factor(c("black", "brown", "blonde", "bald"))
glasses = factor(c("yes", "no", "contact lenses"))
sport = factor(c("football", "basketball", "tennis"))
gender = factor(c("male", "female", "other"))
problem = expand.grid(var1 = hair_color, var2 = glasses, var3 = sport, var4 = gender)
problem$counts = as.integer(rnorm(108, 20,5))
dataset = problem
var1 var2 var3 var4 counts
1 black yes football male 22
2 brown yes football male 16
3 blonde yes football male 12
4 bald yes football male 22
5 black no football male 14
6 brown no football male 19
I then wrote a function that lets the user select rows from this dataset corresponding to a certain profile of characteristics:
my_function <- function(dataset, var1 = NULL, var2 = NULL, var3 = NULL, var4 = NULL) {
# Create a logical vector to store the rows that match the specified criteria
selection <- rep(TRUE, nrow(dataset))
# Filter rows based on the specified levels of var1
if (!is.null(var1)) {
selection <- selection & dataset$var1 %in% var1
}
# Filter rows based on the specified levels of var2
if (!is.null(var2)) {
selection <- selection & dataset$var2 %in% var2
}
# Filter rows based on the specified levels of var3
if (!is.null(var3)) {
selection <- selection & dataset$var3 %in% var3
}
# Filter rows based on the specified levels of var4
if (!is.null(var4)) {
selection <- selection & dataset$var4 %in% var4
}
# Select the rows that match the specified criteria
selected_rows <- dataset[selection, ]
# Return the selected rows
return(selected_rows)
}
And now, to call the function - select all rows where : the hair is "BLACK OR BROWN" AND the glasses are YES:
head(my_function(dataset, var1 = c("black", "brown"), var2 = c("yes")))
var1 var2 var3 var4 counts
1 black yes football male 22
2 brown yes football male 16
13 black yes basketball male 14
14 brown yes basketball male 9
25 black yes tennis male 13
Another example, to call the function - select all rows where : the hair is "BLACK OR BROWN" AND the glasses are NO:
head(my_function(dataset, var1 = c("black", "brown"), var2 = c("no")))
var1 var2 var3 var4 counts
5 black no football male 14
6 brown no football male 19
17 black no basketball male 17
18 brown no basketball male 27
This leads me to my question - suppose I wanted to know the following: What is the (conditional) probability that an athlete wears glasses, given that they have black or brown hair?
Manually, I could answer the question like this:
a = my_function(dataset, var1 = c("black", "brown"), var2 = c("yes"))
b = my_function(dataset, var1 = c("black", "brown"), var2 = c("no"))
prob_yes = sum(a$counts) / (sum(a$counts) sum(b$counts))
prob_no = sum(b$counts) / (sum(a$counts) sum(b$counts))
> prob_yes
[1] 0.481203
> prob_no
[1] 0.518797
I was wondering if I could somehow extend this function to the general sense - suppose I wanted my function to take inputs as:
- Which variables and which levels of these variables (e.g. - not all variables need to be selected)
- Which variable (single variable) should the conditional probability be calculated on (e.g. "glasses")
And as an output:
- All probabilities for all variables of this variable should be calculated
As an example - the desired function could be called like this:
my_function(dataset, input_var_list = c(var1 = c("black", "brown"), var3 = c("football")), conditional_var = c("var2"))
And this desired function would return:
- The probability of wearing glasses given that the athlete has black/brown hair and plays football
- The probability of not wearing glasses given that the athlete has black/brown hair and plays football
Can someone please help me re-write this function?
Thanks!
CodePudding user response:
Not the most efficient approach, but I stayed on the path you went on... Is this what you are looking for?
my_function <- function(dataset, var1 = NULL, var2 = NULL, var3 = NULL, var4 = NULL,
conditional_var = NULL) {
# Create a logical vector to store the rows that match the specified criteria
selection <- rep(TRUE, nrow(dataset))
# Filter rows based on the specified levels of var1
if (!is.null(var1)) {
selection <- selection & dataset$var1 %in% var1
}
# Filter rows based on the specified levels of var2
if (!is.null(var2)) {
selection <- selection & dataset$var2 %in% var2
}
# Filter rows based on the specified levels of var3
if (!is.null(var3)) {
selection <- selection & dataset$var3 %in% var3
}
# Filter rows based on the specified levels of var4
if (!is.null(var4)) {
selection <- selection & dataset$var4 %in% var4
}
# Select the rows that match the specified criteria
selected_rows <- dataset[selection, ]
# Return the selected rows
if (is.null(conditional_var)) {
return(selected_rows)
} else {
return(prop.table(table(selected_rows[conditional_var])))
}
}
my_function(dataset, var1 = c("black", "brown"), var3 = c("football"), conditional_var = c("var2"))
output
var2
contact lenses no yes
0.3333333 0.3333333 0.3333333