How to include "count" frequency of variable combination in logistic regression?-CodePudding

I am trying to build a logistic regression on predicting cancer (1) or no cancer (0) using various categorical variables in the dataset.

In the raw data set, there is a "count" column that indicates the frequency of each combination.

The dataset is large, so in order to reduce the number of rows in the dataset, they added a "count" column to indicate these variables combinations happened xxx times.

How do I incorporate this count column in the logistic regression?

my_model <- glm(cancer ~ age_grp   density   race   bmi, bcancer)

Thank you! Any guidance is much appreciated!

Dataset from BCSC: https://www.bcsc-research.org/data/rfdataset/dataset

CodePudding user response：

You seem to have data like this.

head(dat)
#   cancer age_grp race bmi count
# 1      0       1    1  18   561
# 2      1       1    1  18   997
# 3      0       2    1  18   321
# 4      1       2    1  18   153
# 5      0       3    1  18    74
# 6      1       3    1  18   228

You could calculate weighted regression with count as weights.

summary(glm(cancer ~ age_grp   race   bmi, family=binomial(), dat, 
            weights=count))$coef
#                 Estimate  Std. Error    z value     Pr(>|z|)
# (Intercept)  0.364588477 0.041604639   8.763169 1.898369e-18
# age_grp      0.009726589 0.002182186   4.457269 8.301035e-06
# race         0.020779774 0.005636968   3.686339 2.275036e-04
# bmi         -0.021827620 0.001754685 -12.439623 1.592543e-35

You could also try to "unpack" the data,

dat_unpack <- do.call(rbind.data.frame, 
                apply(dat, 1, \(x)
                      t(replicate(x['count'], x[setdiff(names(x), 'count')]))))

head(dat_unpack)
#   cancer age_grp race bmi
# 1      0       1    1  18
# 2      0       1    1  18
# 3      0       1    1  18
# 4      0       1    1  18
# 5      0       1    1  18
# 6      0       1    1  18

but it's wasted labor of love since, except for the usual rounding errors, the results are identical.

summary(glm(cancer ~ age_grp   race   bmi, family=binomial(), dat_unpack))$coef
#                 Estimate  Std. Error    z value     Pr(>|z|)
# (Intercept)  0.364588477 0.041604640   8.763169 1.898374e-18
# age_grp      0.009726589 0.002182186   4.457268 8.301070e-06
# race         0.020779774 0.005636970   3.686338 2.275043e-04
# bmi         -0.021827620 0.001754685 -12.439621 1.592570e-35

Data

set.seed(42)
dat <- expand.grid(cancer=0:1, age_grp=1:7, race=1:3, bmi=18:26)
dat$count <- sample(1e3, nrow(dat), replace=TRUE)