Regression and Prediction using R-CodePudding

I want to perform the following task using fastfood dataset from openintro package in R.

a) Create a regression predicting whether or not a restaurant is McDonalds or Subway based on calories, sodium, and protein. (McDonalds should be 1, Subway 0).

Save the coefficients to Q2.

b) use data from only restaurants with between 50 and 60 items in the data set. Predict total fat from cholesterol, total carbs, vitamin a, and restaurant. Remove any nonsignificant predictors and run again.

Assign the strongest standardized regression coefficient to Q5.

Here's my code.

library(tidyverse)    
library(openintro)    
library(lm.beta)

fastfood <- openintro::fastfood    
head(fastfood)

#Solving for part (a)    
fit_1 <- lm(I(restaurant %in% c("Subway", "Mcdonalds")) ~ calories   sodium   protein, data = fastfood)

Q2 <- round(summary(fit_1)$coefficients,2)

#Solving for part (b)    
newdata <- fastfood[ which(fastfood$item>=50 & fastfood$item <= 60), ]    
df = sort(sample(nrow(newdata), nrow(data)*.7))    
newdata_train<-data[df,]    
newdata_test<-data[-df,]    
fit_5 <- lm(I(total_fat) ~ cholesterol   total_carb   vit_a   restaurant, data = newdata)    
prediction_5 <- predict(fit_5, newdata = newdata_test)

Q5 <- lm.beta(fit_5)

But I'm not getting desired results

Here's is desired output

output for part (a):

output for part (b):

CodePudding user response：

The first question requires logistic regression rather than linear regression, since the aim is to predict a binary outcome. The most sensible way to do this is, as the question suggests, to remove all the restaurants except McDonald's and Subway, then create a new binary variable to mark which rows are McDonald's and which aren't:

library(dplyr) 

fastfood <- openintro::fastfood %>% 
  filter(restaurant %in% c("Mcdonalds", "Subway")) %>%
  mutate(is_mcdonalds = restaurant == "Mcdonalds")

The logistic regression is done like this:

fit_1 <- glm(is_mcdonalds ~ calories   sodium   protein, 
             family = "binomial", data = fastfood)

And your coefficients are obtained like this:

Q2 <- round(coef(fit_1), 2)

Q2
#> (Intercept)    calories      sodium     protein 
#>       -1.24        0.00        0.00        0.06

The second question requires that you filter out any restaurants with more than 60 or fewer than 50 items:

fastfood <- openintro::fastfood %>%
  group_by(restaurant) %>%
  filter(n() >= 50 & n() <= 60)

We now fit the described regression and examine it to look for non-significant regressors:

fit_2 <- lm(total_fat ~ cholesterol   vit_a   total_carb   restaurant,
            data = fastfood)

summary(fit_2)
#> 
#> Call:
#> lm(formula = total_fat ~ cholesterol   vit_a   total_carb   restaurant, 
#>     data = fastfood)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -24.8280  -2.9417   0.9397   5.1450  21.0494 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         -1.20102    2.08029  -0.577 0.564751    
#> cholesterol          0.26932    0.01129  23.853  < 2e-16 ***
#> vit_a                0.01159    0.01655   0.701 0.484895    
#> total_carb           0.16327    0.03317   4.922 2.64e-06 ***
#> restaurantMcdonalds -4.90272    1.94071  -2.526 0.012778 *  
#> restaurantSonic      6.43353    1.89014   3.404 0.000894 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.611 on 125 degrees of freedom
#>   (34 observations deleted due to missingness)
#> Multiple R-squared:  0.8776, Adjusted R-squared:  0.8727 
#> F-statistic: 179.2 on 5 and 125 DF,  p-value: < 2.2e-16

We note that vit_a is non-significant and drop it from our model:

fit_3 <- update(fit_2, . ~ . - vit_a)

Now we get the regularized coefficients and round them:

coefs <- round(coef(lm.beta::lm.beta(fit_3)), 2)

and Q5 will be the maximum value of these coefficients:

Q5 <- coefs[which.max(coefs)]

Q5
#> cholesterol 
#>        0.82

^{Created on 2022-02-26 by the reprex package (v2.0.1)}