Currently I am writing my master thesis. My university provided me with student data that contains several variables such as age, gender, subject, faculty, courses visited, received grades, did the student churn, and many more. My task is to analyze this data in order to predict which student will churn and which student will gain a degree. Before doing that I qant to try to do an exploratory data analysis. Currently I am stuck at the point where I want to calculate the proportion of two categorical variables: the subject and whether a student churned or not.
I created an easy example for the statistics I want to calculate:
Subject_Churn_df <- data.frame(Subject = c("Math", "Engineering", "IT", "Math", "IT", "IT", "Engineering"),
Churn = c("Yes", "Yes", "No", "No", "Yes", "Yes", "No"))
Now I want to determine which proportion of which subject churned.
I tried the following code:
Subject_Churn_df %>%
select(Subject, Churn) %>%
table() %>%
prop.table()
but as a result I get
Churn
Subject No Yes
Engineering 0.1428571 0.1428571
IT 0.1428571 0.2857143
Math 0.1428571 0.1428571
In this case the proportion is calculated by taking into condieration the whole sample. However, I want to have the churn rate for every subject, e.g.
Engineering 0.5
IT 0.333333
Math: 0.5
I would be grateful for every tip/solution. Thanks very much in advance.
CodePudding user response:
While this is probably better on Stack Overflow your specific issue is that you did not pass the proper parameter to margin
in the proportions
call, so you are getting the proportion over the entire table. As the call to table
puts the subjects in rows, you want to pass $1$ to margin
, like so:
Subject_Churn_df <- data.frame(Subject = c("Math", "Engineering", "IT", "Math", "IT", "IT", "Engineering"),
Churn = c("Yes", "Yes", "No", "No", "Yes", "Yes", "No"))
proportions(table(Subject_Churn_df), margin = 1L)
Which results in:
Churn
Subject No Yes
Engineering 0.5000000 0.5000000
IT 0.3333333 0.6666667
Math 0.5000000 0.5000000
Which is I belive what you wanted.