Home > front end >  How to calculate the proportion of two categorical variables in R
How to calculate the proportion of two categorical variables in R

Time:12-28

Currently I am writing my master thesis. My university provided me with student data that contains several variables such as age, gender, subject, faculty, courses visited, received grades, did the student churn, and many more. My task is to analyze this data in order to predict which student will churn and which student will gain a degree. Before doing that I qant to try to do an exploratory data analysis. Currently I am stuck at the point where I want to calculate the proportion of two categorical variables: the subject and whether a student churned or not.

I created an easy example for the statistics I want to calculate:

Subject_Churn_df <- data.frame(Subject = c("Math", "Engineering", "IT", "Math", "IT", "IT", "Engineering"),
                               Churn = c("Yes", "Yes", "No", "No", "Yes", "Yes", "No"))

Now I want to determine which proportion of which subject churned.

I tried the following code:

Subject_Churn_df %>% 
  select(Subject, Churn) %>% 
  table() %>% 
  prop.table()

but as a result I get

             Churn
Subject              No       Yes
  Engineering 0.1428571 0.1428571
  IT          0.1428571 0.2857143
  Math        0.1428571 0.1428571

In this case the proportion is calculated by taking into condieration the whole sample. However, I want to have the churn rate for every subject, e.g.

Engineering 0.5
IT 0.333333
Math: 0.5

I would be grateful for every tip/solution. Thanks very much in advance.

CodePudding user response:

While this is probably better on Stack Overflow your specific issue is that you did not pass the proper parameter to margin in the proportions call, so you are getting the proportion over the entire table. As the call to table puts the subjects in rows, you want to pass $1$ to margin, like so:

Subject_Churn_df <- data.frame(Subject = c("Math", "Engineering", "IT", "Math", "IT", "IT", "Engineering"),
Churn = c("Yes", "Yes", "No", "No", "Yes", "Yes", "No"))

proportions(table(Subject_Churn_df), margin = 1L)

Which results in:

             Churn
Subject              No       Yes
  Engineering 0.5000000 0.5000000
  IT          0.3333333 0.6666667
  Math        0.5000000 0.5000000

Which is I belive what you wanted.

  • Related