Relative frequency based on all possible factor combinations-CodePudding

I would like to analyse some determinants of high blood pressure, whether having low/high levels of specific blood ingredients influences the risk of high blood pressure. I have the following sample data

set.seed(1234)
df <- tibble::tibble(high_bp = sample(c("low", "high"), 1000, TRUE), 
               amylase  = sample(c("low", "high"), 1000, TRUE), 
               proteins  = sample(c("low", "high"), 1000, TRUE),
               sodium = sample(c("low", "high"), 1000, TRUE))


  high_bp amylase proteins sodium
   <chr>   <chr>   <chr>    <chr> 
 1 high    low     high     low   
 2 high    high    high     high  
 3 low     high    high     high  
 4 high    low     high     low   
 5 high    low     low      low   
 6 high    low     low      high  
 ...

I want to reshape it in a way, such that I get a risk score for each possible combination, where the risk score is a simple relative frequency, e.g. for each row

risk_high_bp= #high_bp == "high / Total Observations in that row

The result could look like this (here only three possible combinations, I would like all of them (order is not relevant) with invented numbers for risk_high_bp:

result_example_small <- tibble::tibble(amylase  = c("low", "low", "high"),
               proteins  = c("low", "high", "high"),
               sodium = c("low", "low", "high"),
               risk_high_bp = c(0.05, 0.14, 0.4))

  amylase proteins sodium risk_high_bp
  <chr>   <chr>    <chr>         <dbl>
1 low     low      low            0.05
2 low     high     low            0.14
3 high    high     high           0.4

Any help is much appreciated. Thank you :)

CodePudding user response：

You can tabulate how often high BP is seen in association with the predictor variables by doing:

library(dplyr)

score_df <- df %>%
  group_by(amylase, proteins, sodium) %>%
  summarise(risk_high_bp = sum(high_bp == 'high')/n())

score_df
#> # A tibble: 8 x 4
#> # Groups:   amylase, proteins [4]
#>   amylase proteins sodium risk_high_bp
#>   <chr>   <chr>    <chr>         <dbl>
#> 1 high    high     high          0.457
#> 2 high    high     low           0.534
#> 3 high    low      high          0.558
#> 4 high    low      low           0.442
#> 5 low     high     high          0.484
#> 6 low     high     low           0.454
#> 7 low     low      high          0.470
#> 8 low     low      low           0.486

You can than re-insert these back into the original data using a left join:

left_join(df, score_df)
#>  # A tibble: 1,000 x 6
#>    high_bp amylase proteins sodium  risk risk_high_bp
#>    <chr>   <chr>   <chr>    <chr>  <int>        <dbl>
#>  1 high    high    high     low        2        0.534
#>  2 high    low     low      low        0        0.486
#>  3 high    high    low      high       2        0.558
#>  4 high    low     high     high       2        0.484
#>  5 low     high    high     high       3        0.457
#>  6 high    high    low      high       2        0.558
#>  7 low     low     low      low        0        0.486
#>  8 low     low     low      low        0        0.486
#>  9 low     high    high     low        2        0.534
#> 10 high    high    high     low        2        0.534
#> # ... with 990 more rows

Notice that where you have the same three values for amylase, protein and sodium, you get the same value for risk_high_bp. This number is the proportion of people who had this particular combination of factors who also turned out to have high BP.

Note though, that this is an unusual way to assess predictors of a binary outcome. Typically we would use logistic regression here. That allows you to say something useful about each of the predictors, including which of them actually turned out to have an association with raised blood pressure, and how strong the effect was. It would also give a more believable prediction for each combination.

To do that, you could replace 'high' and 'low' with 1s and 0s, or equivalently TRUE and FALSE:

df <- as.data.frame(df == 'high')

Now run a logistic regression model:

mod <- glm(high_bp ~ amylase * proteins * sodium, data = df, family = binomial)

We can use this to explore which predictors (and interactions between predictors) are significant, and thereby simplyify the model. We can also get the predictions by doing:

df$risk_high_bp <- predict(mod, type = 'response')

In this case, because of the way the model is specified, we would get the same result in risk_high_bp