I have a dataframe in R
as follows:
set.seed(123)
df <- as.data.frame(matrix(rnorm(20*5,mean = 0,sd=1),20,5))
I want to find the percentage of times that the highest value of each row appears in each column, which I can do as follows:
A <- table(names(df)[max.col(df)])/nrow(df)
Then the percentage of times that the second highest value of each row appears in each column can be found as follows:
df2 <- as.data.frame(t(apply(df,1,function(r) {
r[which.max(r)] <- 0.001
return(r)})))
B <- table(names(df2)[max.col(df2)])/nrow(df2)
How can I calculate in R
the following?
C<- The percentage of times that the first and the second highest values
appear in the first two columns of `df` simultaneously
CodePudding user response:
I would do it like this:
# compute reverse rank
df.rank <- ncol(df) - t(apply(df, 1, rank)) 1
A <- colMeans(df.rank == 1)
B <- colMeans(df.rank == 2)
C <- mean(apply(df.rank[, 1:2], 1, prod)==2)
First I compute reverse rank which is analogous to using decreasing=T
with sort()
or order()
. A and B is then rather straightforward. Please note that your original approach omits zeros for columns where no (second) maximum value appears which may cause problems in later usage.
For C, I take only first two columns of the rank matrix and compute their product for every row. If there are the two largest values in the first two columns the product has to be 2.
Also, if ties might appear in your data set you should consider selecting the appropriate ties.method
argument for rank
.