I am trying to build a logistic regression on predicting cancer (1) or no cancer (0) using various categorical variables in the dataset.
In the raw data set, there is a "count" column that indicates the frequency of each combination.
The dataset is large, so in order to reduce the number of rows in the dataset, they added a "count" column to indicate these variables combinations happened xxx times.
How do I incorporate this count column in the logistic regression?
my_model <- glm(cancer ~ age_grp density race bmi, bcancer)
Thank you! Any guidance is much appreciated!
Dataset from BCSC: https://www.bcsc-research.org/data/rfdataset/dataset
CodePudding user response:
You seem to have data like this.
head(dat)
# cancer age_grp race bmi count
# 1 0 1 1 18 561
# 2 1 1 1 18 997
# 3 0 2 1 18 321
# 4 1 2 1 18 153
# 5 0 3 1 18 74
# 6 1 3 1 18 228
You could calculate weighted regression with count
as weights.
summary(glm(cancer ~ age_grp race bmi, family=binomial(), dat,
weights=count))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604639 8.763169 1.898369e-18
# age_grp 0.009726589 0.002182186 4.457269 8.301035e-06
# race 0.020779774 0.005636968 3.686339 2.275036e-04
# bmi -0.021827620 0.001754685 -12.439623 1.592543e-35
You could also try to "unpack" the data,
dat_unpack <- do.call(rbind.data.frame,
apply(dat, 1, \(x)
t(replicate(x['count'], x[setdiff(names(x), 'count')]))))
head(dat_unpack)
# cancer age_grp race bmi
# 1 0 1 1 18
# 2 0 1 1 18
# 3 0 1 1 18
# 4 0 1 1 18
# 5 0 1 1 18
# 6 0 1 1 18
but it's wasted labor of love since, except for the usual rounding errors, the results are identical.
summary(glm(cancer ~ age_grp race bmi, family=binomial(), dat_unpack))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604640 8.763169 1.898374e-18
# age_grp 0.009726589 0.002182186 4.457268 8.301070e-06
# race 0.020779774 0.005636970 3.686338 2.275043e-04
# bmi -0.021827620 0.001754685 -12.439621 1.592570e-35
Data
set.seed(42)
dat <- expand.grid(cancer=0:1, age_grp=1:7, race=1:3, bmi=18:26)
dat$count <- sample(1e3, nrow(dat), replace=TRUE)