A tricky one for you. I have two data frames, one that is a list of odd ratios (skipping the first as it is our predictor). See below:
Variable Name | Odds |
---|---|
Var2 | 0.87 |
Var3 | 1.42 |
Var4 | 2.10 |
Var5 | 0.56 |
Var6 | 1.01 |
The second is a list of subjects, with the variables as columns and whether it is present as a binary flag (0/1)
. See below:
Subject Name | Var 1 (Predictor) | Var 2 | Var 3 | Var 4 | Var 5 | Var 6 |
---|---|---|---|---|---|---|
Dog | 1 | 0 | 1 | 0 | 0 | 1 |
Cat | 1 | 0 | 0 | 0 | 1 | 1 |
Elephant | 0 | 0 | 0 | 0 | 0 | 0 |
Bear | 1 | 0 | 0 | 1 | 1 | 1 |
Jackal | 0 | 0 | 0 | 0 | 0 | 1 |
What I now need to do is add that x
number of new variables by their odds ratio if present. For example. Dog
has 2 present predictors. Var 3
and Var 6
so we want to multiple Var3=1.42 * Var6=1.01
= 1.43 leaving the others blank. For Jackal
we have one predictor in Var 6=1.01
leaving the others blank. Adding these in as individual variables and their total multiplication is prefered.
The number of variables may change from five, too six or seven, so specifying names for the function will not work. It needs to be based on the number of initial variables (-1 for the predictor).
Thus far I have tried writing a complex ifelse
statement but it doesn't work for the dynamic range. Something on matching names which get added in with a prefix / suffix? Or by position? I am genuinely stumped in doing this the most efficient way. Hope it's clear, happy to provide more detail if required / requested.
CodePudding user response:
Let us assume that your first data frame is called odds_df
and your second is called presence_df
. I have taken the data from your question and made reproducible versions of these at the bottom of this answer.
As you may have different numbers of columns representing variables in your presence / absence data frame, you will need some way of identifying which columns you are trying to match to your odds data. In your example data, this could be according to variable names, using string matching:
var_cols <- grep("^Var\\d $", names(presence_df))
or simply according to which columns you know to be those of interest:
var_cols <- 3:7
Or, if the first two columns are always the name and dependent variable, then a more general solution would be:
var_cols <- 3:ncol(presence_df)
In this case, any of these three ways of generating var_cols
will give you the vector c(3, 4, 5, 6, 7)
.
You now need to know which odds apply to which columns, so we match
the columns of our presence / absence data frame with the correct row in the odds data frame like this:
var_rows <- match(names(presence_df)[var_cols], odds_df$`Variable Name`)
Now we are in a position to get the vector of odds we need to apply to our columns:
odds_vec <- odds_df$Odds[var_rows]
To get the correct odds that should apply for each entry given its presence or absence, we need to remember that since odds are multiplicative, any absent variable should be given a value of 1, not 0, so that it does not affect the odds calculation. This means we should multiply each row of our presence / absence data by the log odds and exponentiate the result to get the actual odds.
We can do this row-wise using apply
:
res <- t(apply(presence_df[var_cols], 1, function(x) exp(x * log(odds_vec))))
res
#> Var2 Var3 Var4 Var5 Var6
#> [1,] 1 1.42 1.0 1.00 1.01
#> [2,] 1 1.00 1.0 0.56 1.01
#> [3,] 1 1.00 1.0 1.00 1.00
#> [4,] 1 1.00 2.1 0.56 1.01
#> [5,] 1 1.00 1.0 1.00 1.01
Now we can write this back into our presence / absence data frame like so:
presence_df[var_cols] <- res
presence_df
#> Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6
#> 1 Dog 1 1 1.42 1.0 1.00 1.01
#> 2 Cat 1 1 1.00 1.0 0.56 1.01
#> 3 Elephant 0 1 1.00 1.0 1.00 1.00
#> 4 Bear 1 1 1.00 2.1 0.56 1.01
#> 5 Jackal 0 1 1.00 1.0 1.00 1.01
The final step is to calculate the odds implied by the presence or absence variables, which is simply the row-wise product of these odds:
presence_df$odds <- apply(presence_df[var_cols], 1, prod)
presence_df
#> Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6 odds
#> 1 Dog 1 1 1.42 1.0 1.00 1.01 1.43420
#> 2 Cat 1 1 1.00 1.0 0.56 1.01 0.56560
#> 3 Elephant 0 1 1.00 1.0 1.00 1.00 1.00000
#> 4 Bear 1 1 1.00 2.1 0.56 1.01 1.18776
#> 5 Jackal 0 1 1.00 1.0 1.00 1.01 1.01000
It might also be helpful to convert the odds into probabilities, so that we can see the probability of the dependent variable being present given the predictor variables:
presence_df$prob <- presence_df$odds / (1 presence_df$odds)
presence_df
#> Subject Name Var 1 (Predictor) Var2 Var3 Var4 Var5 Var6 odds prob
#> 1 Dog 1 1 1.42 1.0 1.00 1.01 1.43420 0.5891874
#> 2 Cat 1 1 1.00 1.0 0.56 1.01 0.56560 0.3612672
#> 3 Elephant 0 1 1.00 1.0 1.00 1.00 1.00000 0.5000000
#> 4 Bear 1 1 1.00 2.1 0.56 1.01 1.18776 0.5429115
#> 5 Jackal 0 1 1.00 1.0 1.00 1.01 1.01000 0.5024876
** Data from question in reproducible format **
odds_df <- structure(list(`Variable Name` = c("Var2", "Var3", "Var4", "Var5",
"Var6"), Odds = c(0.87, 1.42, 2.1, 0.56, 1.01)), row.names = c(NA,
-5L), class = "data.frame")
presence_df <- structure(list(`Subject Name` = c("Dog", "Cat", "Elephant",
"Bear", "Jackal"), `Var 1 (Predictor)` = c(1L, 1L, 0L, 1L, 0L), Var2 = c(0L,
0L, 0L, 0L, 0L), Var3 = c(1L, 0L, 0L, 0L, 0L), Var4 = c(0L, 0L,
0L, 1L, 0L), Var5 = c(0L, 1L, 0L, 1L, 0L), Var6 = c(1L, 1L, 0L,
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))