I would like to analyse some determinants of high blood pressure, whether having low/high levels of specific blood ingredients influences the risk of high blood pressure. I have the following sample data
set.seed(1234)
df <- tibble::tibble(high_bp = sample(c("low", "high"), 1000, TRUE),
amylase = sample(c("low", "high"), 1000, TRUE),
proteins = sample(c("low", "high"), 1000, TRUE),
sodium = sample(c("low", "high"), 1000, TRUE))
high_bp amylase proteins sodium
<chr> <chr> <chr> <chr>
1 high low high low
2 high high high high
3 low high high high
4 high low high low
5 high low low low
6 high low low high
...
I want to reshape it in a way, such that I get a risk score for each possible combination, where the risk score is a simple relative frequency, e.g. for each row
risk_high_bp
= #high_bp == "high / Total Observations in that row
The result could look like this (here only three possible combinations, I would like all of them (order is not relevant) with invented numbers for risk_high_bp
:
result_example_small <- tibble::tibble(amylase = c("low", "low", "high"),
proteins = c("low", "high", "high"),
sodium = c("low", "low", "high"),
risk_high_bp = c(0.05, 0.14, 0.4))
amylase proteins sodium risk_high_bp
<chr> <chr> <chr> <dbl>
1 low low low 0.05
2 low high low 0.14
3 high high high 0.4
Any help is much appreciated. Thank you :)
CodePudding user response:
You can tabulate how often high BP is seen in association with the predictor variables by doing:
library(dplyr)
score_df <- df %>%
group_by(amylase, proteins, sodium) %>%
summarise(risk_high_bp = sum(high_bp == 'high')/n())
score_df
#> # A tibble: 8 x 4
#> # Groups: amylase, proteins [4]
#> amylase proteins sodium risk_high_bp
#> <chr> <chr> <chr> <dbl>
#> 1 high high high 0.457
#> 2 high high low 0.534
#> 3 high low high 0.558
#> 4 high low low 0.442
#> 5 low high high 0.484
#> 6 low high low 0.454
#> 7 low low high 0.470
#> 8 low low low 0.486
You can than re-insert these back into the original data using a left join:
left_join(df, score_df)
#> # A tibble: 1,000 x 6
#> high_bp amylase proteins sodium risk risk_high_bp
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 high high high low 2 0.534
#> 2 high low low low 0 0.486
#> 3 high high low high 2 0.558
#> 4 high low high high 2 0.484
#> 5 low high high high 3 0.457
#> 6 high high low high 2 0.558
#> 7 low low low low 0 0.486
#> 8 low low low low 0 0.486
#> 9 low high high low 2 0.534
#> 10 high high high low 2 0.534
#> # ... with 990 more rows
Notice that where you have the same three values for amylase, protein and sodium, you get the same value for risk_high_bp
. This number is the proportion of people who had this particular combination of factors who also turned out to have high BP.
Note though, that this is an unusual way to assess predictors of a binary outcome. Typically we would use logistic regression here. That allows you to say something useful about each of the predictors, including which of them actually turned out to have an association with raised blood pressure, and how strong the effect was. It would also give a more believable prediction for each combination.
To do that, you could replace 'high' and 'low' with 1s and 0s, or equivalently TRUE
and FALSE
:
df <- as.data.frame(df == 'high')
Now run a logistic regression model:
mod <- glm(high_bp ~ amylase * proteins * sodium, data = df, family = binomial)
We can use this to explore which predictors (and interactions between predictors) are significant, and thereby simplyify the model. We can also get the predictions by doing:
df$risk_high_bp <- predict(mod, type = 'response')
In this case, because of the way the model is specified, we would get the same result in risk_high_bp