I have a set of regressors selected by an optimal variable selection algorithm (out of a total of 30 candidate regressor columns) run on a dataset that looks like the following:
> IVs_Selected_by_LASSO
$coefficients
[1] "X4" "X10" "X19" "X20" "X22" "X25" "X27" "X28"
And, I also have the set of actual regressors included in the structural equation describing that dataset, which initially looks like the following:
True_IVs
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1
X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
1 0 1 0 0 1 1 1 1 0 1
dput(True_IVs)
structure(list(X1 = "0", X2 = "0", X3 = "0", X4 = "1", X5 = "0",
X6 = "0", X7 = "0", X8 = "0", X9 = "0", X10 = "1", X11 = "0",
X12 = "0", X13 = "1", X14 = "0", X15 = "0", X16 = "0", X17 = "0",
X18 = "0", X19 = "1", X20 = "1", X21 = "0", X22 = "1", X23 = "0",
X24 = "0", X25 = "1", X26 = "1", X27 = "1", X28 = "1", X29 = "0",
X30 = "1"), row.names = 1L, class = "data.frame")
However, using the answer to an intermediate question I asked whose answer I was hoping would allow me to finally ask this current question I am asking, I was able to transform the true structural regressors into a format that is able to be logically compared with the format IVs_Selected_by_LASSO is already in as follows:
True_Regressors <- names(True_IVs)[True_IVs == 1]
> True_Regressors
[1] "X4" "X10" "X13" "X19" "X20" "X22" "X25" "X26" "X27" "X28" "X30"
Me being an experienced Excel user, definitely with more experience there than in R, if I had columns or rows with this information in them, I would probably use either a SUMIF or an IF function (with an AUTOSUM at the bottom of the column as you do) to solve this, but I can't figure out how to do the same in R for the life of me.
For example, if I were to do the IF AUTOSUM solution and pretending that the output of printing True_IVs is in my A column in Excel with IVs_Selected_by_LASSO in the B column, I would use
=IF(A1 = B1, 1, 0)
And since A1 would be X4 and so would B1, if I put this IF function next to them in the C column, I would get a 1 in cell C1. And after applying the AUTOSUM function at the bottom of column C, I would get 8.
Thus, for this example, I would want to get a function which returns 8 because True_IVs contains all 8 of the variables selected by the LASSO Regression. Furthermore, for this example, the maximum number possible would be 11 since True_IVs contains 11 variable names, so clearly, IVs_Selected_by_LASSO can only hope to obtain a maximum of 11 matches.
CodePudding user response:
If you are trying to compute the number of elements in common to both IVs_Selected_by_LASSO$coefficients
and True_Regressors
then
length(intersect(IVs_Selected_by_LASSO$coefficients, True_Regressors))
intersect
does not require that the contents of the arguments be in the same order.
Note that there is an R package called ExcelFunctionsR and also if you google R SUMIF you will get some hits.
CodePudding user response:
You have multiple options as mentioned by others and myself in the comments, but the issue here seems to be a disconnect between Excel logic and R logic, so I will try to provide more detailed insight as a formal answer.
As you know, your excel code:
=IF(A1 = B1, 1, 0)
Will check the single cell in A1
to see if it is equal to the single cell B1
, and that can be copied down the line.
If take the vector of column names of your data frame True_IVs
(which has 30 columns) via names(True_IVs)
and think of it as all 30 column names a being copied and pasted into a column in excel, we can check those names against another vector of names - here IVs_Selected_by_LASSO$coefficients
. Except instead of one-cell-vs-one-cell we can check if the name appears in the whole vector by using %in%
:
names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients
Which checks if each value in names(True_IVs)
is in any of the values in IVs_Selected_by_LASSO$coefficients
. So, for instance, if the fist value in names(True_IVs)
is anywhere in IVs_Selected_by_LASSO$coefficients
, it will return TRUE
, else it will return FALSE
. Running this across the whole vector returns a boolean vector of length 30:
# [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# [13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
# [25] TRUE FALSE TRUE TRUE FALSE FALSE
The values of which corresponds to if each element in the names(True_IVs)
is found anywhere in the IVs_Selected_by_LASSO$coefficients
vector.
Secondly, as noted in the comments, when summing R assumes TRUE == 1
and FALSE == 0
, so wrapping this in a sum
will return the total number included:
sum(names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients)
# [1] 8
This provides more flexibility, particularly if the values are out of order. For instance, if we scramble the values to a random order:
IVs_Selected_by_LASSO$coefficients <- sample(IVs_Selected_by_LASSO$coefficients, length(IVs_Selected_by_LASSO$coefficients))
We get the same result:
sum(names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients)
# [1] 8
There are of course multiple other ways to do this, for instance @G. Grothendieck mentions elegantly using length
and intersect
:
length(intersect(IVs_Selected_by_LASSO$coefficients, names(True_IVs))) # Thanks G. Grothendieck
# [1] 8
Or more inelegantly using length
with names(True_IVs)
instead of sum
:
length(names(True_IVs)[names(True_IVs) %in% IVs_Selected_by_LASSO$coefficients])
# [1] 8
Hope this helps!
CodePudding user response:
A rough recreation of your example and applying the sum()
function and %in%
operator:
> IVs_Selected_by_LASSO <- c("X4", "X10", "X19", "X20", "X22", "X25",
"X27", "X28")
> head(IVs_Selected_by_LASSO)
[1] "X4" "X10" "X19" "X20" "X22" "X25"
>
> True_IVs <- data.frame(X = c('X1', 'X2', 'X3', 'X4', 'X5', 'X6',
'X7', 'X8', 'X9', 'X10', 'X11', 'X12',
'X13', 'X14', 'X15', 'X16', 'X17',
'X18','X19', 'X20', 'X21', 'X22', 'X23',
'X24', 'X25', 'X26', 'X27', 'X28',
'X29','X30'),
Y = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1,
1, 1, 0, 1))
> head(True_IVs)
X Y
1 X1 0
2 X2 0
3 X3 0
4 X4 1
5 X5 0
6 X6 0
>
> sum(True_IVs$X %in% IVs_Selected_by_LASSO$X)
[1] 8