How to sum all columns based on a condition in R to find percentage-CodePudding

I'm trying to sum all columns at once by using this condition: find only values greater than 5 then divide by the length of the column. But it did not work.

Here is what I did:

x1 is only one column in my data and I have 100 columns.

#create data frame
df <- data.frame(x1 = c(7, 3, 1, 9, 12, 8),
                 x2 = c(7, 5, 6, 1, 4, 4))

reault1<- sum(df$x1>5)/length(df$x1)
#7 is greater than 5
#9 is greater than 5
#12 is greater than 5
#8 is greater than 5
# which means 4 times. then 4/total numbers which is 6
view(reault1) #  .66%

CodePudding user response：

Update

You can just use colMeans on the logical condition:

colMeans(df > 5)

#        x1        x2 
# 0.6666667 0.3333333

Or with dplyr:

library(dplyr)

df %>%
  summarise(across(everything(), ~ mean(.x > 5)))

Original Answer

It's a little unclear what the expected output should be. To get the sum of all values greater than 5 for each column, then we can first find the values that are greater than 5 (i.e., mydata > 5). Then, we can compare to the original dataframe using *, which will change the logical to a 0 or 1 if meets the condition or not (so in reality we are just multiplying by 0 or 1). Then, we can get the sum of the column.

mydata <- mtcars[1:10, 1:5]

colSums(mydata * (mydata > 5))
    
#   mpg    cyl   disp     hp   drat 
# 203.7   46.0 2086.1 1228.0    0.0

If you just want to get the mean using the full number of rows, then we can just use colMeans with the same logic:

colMeans(mydata * (mydata > 5))

#   mpg    cyl   disp     hp   drat 
# 20.37   4.60 208.61 122.80   0.00

However, if you are wanting to divide by the number of values greater than 5, then we could do something like this:

apply(mydata, 2, function(x)
  sum(x * (x > 5)) / sum(x > 5))