I have this code in R:
mp <- max_power$max
names(mp) <- max_power$athlete
x$normalized_power <- as.vector(x$power / mp[x$athlete])
x$temp_dist <- abs(x$temperature - 60)
x$log_time <- log(x$time)
normalized_power <- x$normalized_power
fit <- glm(
cbind(normalized_power, 1 - normalized_power) ~ log_time temp_dist,
family = binomial(),
data = x
)
I understand everything except for the cbind() in the equation. I am trying to replicate this using python:
self.pmax = max(self.power_array)
normalized_power = pd.DataFrame([x / self.pmax for x in self.power_array])
one_minus_np = pd.DataFrame([1 - x for x in [x / self.pmax for x in self.power_array]])
log_time = [math.log(x) for x in self.time_array]
temp_diff = [abs(x - 60) for x in self.temp_array]
df = pd.concat([normalized_power, one_minus_np], axis=1)
df.columns = ["np", "1-np"]
df["log_time"] = log_time
df["temp_diff"] = temp_diff
model = sm.glm("np ~ log_time temp_diff", df, family=family.Binomial()).fit()
Can someone explain what the cbind() is doing and how it can be replicated?
CodePudding user response:
Usually when you run a logistic regression, your outcome variable (on the left of the regression formula) is a simple TRUE/FALSE (or 1/0 or success/failure) column that indicates absence or presence of the characteristic you are trying to model.
However, we can sometimes have the situation where we have a column of counts of successes and another column of counts of failures.
For example, suppose I get 8 men to take 10 shots at a basketball hoop and record their scores. I also measure their height because I want to know whether this predicts their accuracy.
My data might look something like this:
baskets <- data.frame(height = c(1.5, 1.95, 1.8, 1.76, 1.52, 1.91, 1.66, 1.68),
shots_on_target = c(4, 9, 6, 8, 3, 9, 5, 5))
baskets
#> height shots_on_target
#> 1 1.50 4
#> 2 1.95 9
#> 3 1.80 6
#> 4 1.76 8
#> 5 1.52 3
#> 6 1.91 9
#> 7 1.66 5
#> 8 1.68 5
If I want to run a logistic regression on these data, I need to pass a column of successes and a column of failures as the outcome variable. Fortunately, glm
allows us to do just that. All we need to do is bind the columns together with cbind
- this will convert the success / failures columns into a single 2-column matrix.
Of course, I don't have a failures column, but since I know that each person had 10 shots, I can easily create it by doing 10 - shots_on_target
Therefore, my model can be created like so:
model <- glm(cbind(shots_on_target, 10 - shots_on_target) ~ height,
data = baskets, family = binomial)
summary(model)
#>
#> Call:
#> glm(formula = cbind(shots_on_target, 10 - shots_on_target) ~
#> height, family = binomial, data = baskets)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.94529 -0.31349 0.00671 0.52595 0.80363
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -10.094 3.038 -3.323 0.000892 ***
#> height 6.182 1.788 3.457 0.000546 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 16.944 on 7 degrees of freedom
#> Residual deviance: 2.564 on 6 degrees of freedom
#> AIC: 26.534
#>
#> Number of Fisher Scoring iterations: 4
and we can see that height was positively predictive of the number of shots on target.
In your example, the format for the outcome variable is similar to my example. However, it doesn't make a lot of sense for the numbers to be normalized to between 0 and 1. The two input columns should really be integers for this approach to make sense.
Created on 2021-11-04 by the reprex package (v2.0.0)