I trying Logistic regression on a dataset. I have successfully divided my dataset into train and test. The regression model also works fine however when I apply it on my test I only get an outcome for 393 observations when the length of my test dataset is 480. How can I compare and get the mismatch or find out what went wrong?
My data has no NAs.
I am trying to create a confusion matrix.
This is my code:
n=nrow(wine_log)
shuffled=wine_log[sample(n),]
train_indices=1:round(0.7*n)
test_indices=(round(0.7*n) 1):n
#Making a new dataset
train=shuffled[train_indices,]
test=shuffled[test_indices,]
wmodel = glm(final_take~., family = binomial, data=train)
summary(wmodel)
result1 = predict(wmodel, newdata = test, type = 'response')
result1 = ifelse(result > 0.5, 1, 0) - Can someone also explain how will removing this affect the outcome?
result1
> table(result1)
result1
0 1
255 138
> table(test$final_take)
Bad Good
418 62
structure(list(fixed_acid = c(7.4, 7.8, 7.8, 11.2, 7.4, 7.4,
7.9, 7.3, 7.8, 7.5), vol_acid = c(0.7, 0.88, 0.76, 0.28, 0.7,
0.66, 0.6, 0.65, 0.58, 0.5), c_acid = c(0, 0, 0.04, 0.56, 0,
0, 0.06, 0, 0.02, 0.36), res_sugar = c(1.9, 2.6, 2.3, 1.9, 1.9,
1.8, 1.6, 1.2, 2, 6.1), chlorides = c(0.076, 0.098, 0.092, 0.075,
0.076, 0.075, 0.069, 0.065, 0.073, 0.071), free_siox = c(11,
25, 15, 17, 11, 13, 15, 15, 9, 17), total_diox = c(34, 67, 54,
60, 34, 40, 59, 21, 18, 102), density = c(0.9978, 0.9968, 0.997,
0.998, 0.9978, 0.9978, 0.9964, 0.9946, 0.9968, 0.9978), pH = c(3.51,
3.2, 3.26, 3.16, 3.51, 3.51, 3.3, 3.39, 3.36, 3.35), sulphates = c(0.56,
0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0.57, 0.8), alcohol = c(9.4,
9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10, 9.5, 10.5), final_take = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("Bad", "Good"
), class = "factor")), row.names = c(NA, -10L), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"),
CodePudding user response:
Your line of code here:
result1 = ifelse(result > 0.5, 1, 0)
Should be referencing result1
in the ifelse
statement. I'm guessing that result
is another object you have in your environment that isn't 480 rows.
So you should use this instead.
result1 = ifelse(result1 > 0.5, 1, 0)
You also asked what this line of code is doing. It's basically a threshold for your predictions from the glm
model. If the prediction from the model is greater than 0.50, then you are translating the prediction to a "1". If it's less than or equal to 0.50 then you are translating that prediction to a "0". It's a way to convert a probability to a TRUE/FALSE or 1/0.