R code - Why is my frequency table giving me wrong percentage numbers? I have reproducible code belo-CodePudding

I have the below df. They are frequency counts:

pnb3 <- structure(list(Likelihood.to.Click.Freq = c(29L, 71L, 120L), 
    Likelihood.to.Enroll.Freq = c(30L, 84L, 106L), Likelihood.to.Click.1.Freq = c(54L, 
    90L, 108L), Likelihood.to.Enroll.1.Freq = c(55L, 109L, 88L
    ), Likelihood.to.Click_0.Freq = c(50L, 77L, 86L), Likelihood.to.Enroll_0.Freq = c(49L, 
    93L, 71L), Likelihood.to.Click_1.Freq = c(25L, 63L, 163L), 
    Likelihood.to.Enroll._0.Freq = c(26L, 90L, 135L), Likelihood.to.Click_2.Freq = c(63L, 
    74L, 94L), Likelihood.to.Enroll_1.Freq = c(61L, 95L, 75L), 
    Likelihood.to.Click_3.Freq = c(22L, 51L, 157L), Likelihood.to.Enroll._1.Freq = c(24L, 
    93L, 113L), Likelihood.to.Click_4.Freq = c(42L, 66L, 118L
    ), Likelihood.to.Enroll._2.Freq = c(39L, 90L, 97L), Likelihood.to.Click_5.Freq = c(25L, 
    47L, 157L), Likelihood.to.Enroll_2.Freq = c(26L, 75L, 128L
    ), Likelihood.to.Click_6.Freq = c(42L, 84L, 96L), Likelihood.to.Enroll_3.Freq = c(38L, 
    103L, 81L), Likelihood.to.Click_7.Freq = c(30L, 69L, 105L
    ), Likelihood.to.Enroll_4.Freq = c(28L, 88L, 88L), Likelihood.to.Click_8.Freq = c(29L, 
    57L, 140L), Likelihood.to.Enroll_5.Freq = c(27L, 90L, 109L
    ), Likelihood.to.Click_9.Freq = c(40L, 70L, 109L), Likelihood.to.Enroll_6.Freq = c(34L, 
    94L, 91L), Likelihood.to.Click_10.Freq = c(31L, 75L, 135L
    ), Likelihood.to.Enroll_7.Freq = c(32L, 93L, 116L)), class = "data.frame", row.names = c(NA, 
-3L))

but when I try to change the counts to %. The last row is incorrect. It should be ~54/55 percent. But I am getting ~47/48 percent. I dont think its a rounding error as its off by quite a bit. Basically in each set of outputs one number comes out incorrect.

Here is the code I use to change frequency counts to percentage. Is there anything wrong with it? I know theres ways to use a function but I wanted to break it down to see each step:

pnb4 <- pnb3 / (colSums(pnb3))
pnb5 <- pnb4 *100
pnb6 <- round(pnb5,1)

If you run it you'll notice the third % is off by quite a bit.

UPDATE: for example once I run the above the first output gives me this

enter image description here

but the third row should actually be 54% (because 120/220 = 54%)

CodePudding user response：

The problem is that your code isn't vectorized in the way you want it to be. What your code does it takes the first value of column 1 and divides it by the colSum for column 1. Then it takes the second row for column 1 and divides it by the colSum for column 2 (which still is correct because both colsums are the same). But when you get to the third row, it divides by teh colsum for col 3 (i.e. 252) and that is not correct.

You can do:

library(dplyr)
pnb3 %>%
  mutate(across(everything(), ~round(./sum(.)*100, 1)))

Here's the result for the first few columns:

# A tibble: 3 x 26
  Likelihood.to.C~ Likelihood.to.E~ Likelihood.to.C~ Likelihood.to.E~
             <dbl>            <dbl>            <dbl>            <dbl>
1             13.2             13.6             21.4             21.8
2             32.3             38.2             35.7             43.3
3             54.5             48.2             42.9             34.9

CodePudding user response：

When you divide a data frame by a vector, (as in pnb3 / (colSums(pnb3))), R will divide each column by that vector, not the first column by the first element, second column by second element, etc. The only reason your solution looks anything close to correct is because the column sums don't vary too much.

We can see this more easily with a small example:

x = data.frame(a = 1:2, b = 101:102)
x
#   a   b
# 1 1 101
# 2 2 102
x / colSums(x)
#             a          b
# 1 0.333333333 33.6666667
# 2 0.009852217  0.5024631

The result is df$a / colSums(df) and df$b / colSums(df), but what you want is df$a / colSums(df)[1], df$b / colSums(df)[2].

Better to apply a function to each column:

lapply(x, \(z) round(z / sum(z) * 100, 1)) |> as.data.frame()
#      a    b
# 1 33.3 49.8
# 2 66.7 50.2