Home > Software design >  I'm trying to make a function to calculate percent difference for all pair combinations within
I'm trying to make a function to calculate percent difference for all pair combinations within

Time:09-21

So I need to calculate the percent difference of all combinations of values in column y. For example the difference between B 1 and B 2. Then the Difference between B 1 and B 3, and so on for all combinations of B. Then the same thing for all combinations of D.

Here is some example data...

structure(list(Levelname = c("B 1", "B 2", "B 3", 
"B 4", "D 1", "D 2", "D 3", "D 4"), y = c(0.679428655093332, 
1.07554328679719, 0.883000346050764, 0.791772867506205, 0.538143790501689, 
0.805122127560562, 0.591353204313314, 0.795225886492002), fill = c("midnightblue", 
"dodgerblue4", "steelblue3", "lightskyblue", "midnightblue", 
"dodgerblue4", "steelblue3", "lightskyblue"), species = c("White Grunt", 
"White Grunt", "White Grunt", "White Grunt", "White Grunt", "White Grunt", 
"White Grunt", "White Grunt")), row.names = c(NA, -8L), class = "data.frame")

My ideal output would be a dataframe with some sort of identifier like

Pair           Percent Difference
B 1 - B 2      45.142
B 1 - B 3      .....
B 1 - B 4      .....
B 2 - B 3      .....
B 2 - B 4      .....
B 3 - B 4      .....
D 1 - D 2      .....
D 1 - D 3      .....
D 1 - D 4      .....
D 2 - D 3      .....
D 2 - D 4      .....
D 3 - D 4      .....
where ..... are the percent differences

I don't care about the differences between B and D. Also I'm trying to get better at functions, for loops, and the apply functions of r, so if answers can use those or a variety of those that would be great.

I tried to look at these answers but I couldn't figure it out...

Loops in R - Need to use index, anyway to avoid 'for'? How can I calculate the percentage change within a group for multiple columns in R?

The 45.142 I calculated using this

   |B1−B2|/[(B1 B2)/2]×100=? 
    =|0.679428655093331.0755433|/[(0.67942865509333 1.0755433)/2]×100
    =|0.39611464490667|/[1.7549719550933/2]×100
    =0.39611464490667/0.87748597754667×100
    =0.45142×100
    =45.142% difference

CodePudding user response:

Using tidyverse:

library(tidyverse)

df %>%
  group_by(grp = str_extract(Levelname, "\\w "))%>%
  summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
            perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
            .groups = 'drop')



A tibble: 12 x 3
   grp   pair      perc_diff
   <chr> <chr>         <dbl>
 1 B     B 1 - B 2     45.1 
 2 B     B 1 - B 3     26.1 
 3 B     B 1 - B 4     15.3 
 4 B     B 2 - B 3     19.7 
 5 B     B 2 - B 4     30.4 
 6 B     B 3 - B 4     10.9 
 7 D     D 1 - D 2     39.8 
 8 D     D 1 - D 3      9.42
 9 D     D 1 - D 4     38.6 
10 D     D 2 - D 3     30.6 
11 D     D 2 - D 4      1.24
12 D     D 3 - D 4     29.4 

CodePudding user response:

We can use outer, calculations with the y values and paste with the Levelnames, where we just use the upper.tri in each case.

f <- \(x, y) abs(x - y)*100 / ((x   y) / 2)  ## your p_diff formula

p_diff <- outer(dt$y, dt$y, f) |>
  {\(x) abs(x[upper.tri(x)])}() |>
  round(3)

Pair <- outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
  {\(x) x[upper.tri(x)]}()

res <- data.frame(Pair, p_diff)

Result

res
#         Pair p_diff
# 1  B 1 - B 2 45.142
# 2  B 1 - B 3 26.058
# 3  B 2 - B 3 19.662
# 4  B 1 - B 4 15.272
# 5  B 2 - B 4 30.393
# 6  B 3 - B 4 10.894
# 7  B 1 - D 1 23.208
# 8  B 2 - D 1 66.605
# 9  B 3 - D 1 48.532
# 10 B 4 - D 1 38.142
# 11 B 1 - D 2 16.934
# 12 B 2 - D 2 28.758
# 13 B 3 - D 2  9.227
# 14 B 4 - D 2  1.672
# 15 D 1 - D 2 39.751
# 16 B 1 - D 3 13.862
# 17 B 2 - D 3 58.095
# 18 B 3 - D 3 39.563
# 19 B 4 - D 3 28.981
# 20 D 1 - D 3  9.422
# 21 D 2 - D 3 30.615
# 22 B 1 - D 4 15.705
# 23 B 2 - D 4 29.968
# 24 B 3 - D 4 10.460
# 25 B 4 - D 4  0.435
# 26 D 1 - D 4 38.561
# 27 D 2 - D 4  1.237
# 28 D 3 - D 4 29.407

Benchmark

I doubted that the tidy approach was faster, and I was right. Here I provide a benchmark comparing the solutions so far. Accordingly, the outer approach is almost 20 times faster.

f1 <- \() data.frame(p_diff=outer(dt$y, dt$y, f) |>
                       {\(x) abs(x[upper.tri(x)])}() |>
                       round(3), 
                     Pair=outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
                       {\(x) x[upper.tri(x)]}())
library(dplyr);library(stringr)
f2 <- \() dt %>%
  group_by(grp = str_extract(Levelname, "\\w "))%>%
  summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
            perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
            .groups = 'drop')
dt <- dt[sample(nrow(dt), 1e3, replace=T), ]

microbenchmark::microbenchmark(outer=f1(), tidyverse=f2(), times=3L)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval cld
#     outer  236.7207  243.0496  306.1265  249.3785  340.8294  432.2804     3  a 
# tidyverse 4819.3476 4830.7364 4838.5051 4842.1251 4848.0839 4854.0427     3   b
  • Related