So I need to calculate the percent difference of all combinations of values in column y
. For example the difference between B 1 and B 2. Then the Difference between B 1 and B 3, and so on for all combinations of B. Then the same thing for all combinations of D.
Here is some example data...
structure(list(Levelname = c("B 1", "B 2", "B 3",
"B 4", "D 1", "D 2", "D 3", "D 4"), y = c(0.679428655093332,
1.07554328679719, 0.883000346050764, 0.791772867506205, 0.538143790501689,
0.805122127560562, 0.591353204313314, 0.795225886492002), fill = c("midnightblue",
"dodgerblue4", "steelblue3", "lightskyblue", "midnightblue",
"dodgerblue4", "steelblue3", "lightskyblue"), species = c("White Grunt",
"White Grunt", "White Grunt", "White Grunt", "White Grunt", "White Grunt",
"White Grunt", "White Grunt")), row.names = c(NA, -8L), class = "data.frame")
My ideal output would be a dataframe with some sort of identifier like
Pair Percent Difference
B 1 - B 2 45.142
B 1 - B 3 .....
B 1 - B 4 .....
B 2 - B 3 .....
B 2 - B 4 .....
B 3 - B 4 .....
D 1 - D 2 .....
D 1 - D 3 .....
D 1 - D 4 .....
D 2 - D 3 .....
D 2 - D 4 .....
D 3 - D 4 .....
where ..... are the percent differences
I don't care about the differences between B and D. Also I'm trying to get better at functions, for
loops, and the apply
functions of r, so if answers can use those or a variety of those that would be great.
I tried to look at these answers but I couldn't figure it out...
Loops in R - Need to use index, anyway to avoid 'for'? How can I calculate the percentage change within a group for multiple columns in R?
The 45.142
I calculated using this
|B1−B2|/[(B1 B2)/2]×100=?
=|0.67942865509333−1.0755433|/[(0.67942865509333 1.0755433)/2]×100
=|−0.39611464490667|/[1.7549719550933/2]×100
=0.39611464490667/0.87748597754667×100
=0.45142×100
=45.142% difference
CodePudding user response:
Using tidyverse:
library(tidyverse)
df %>%
group_by(grp = str_extract(Levelname, "\\w "))%>%
summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
.groups = 'drop')
A tibble: 12 x 3
grp pair perc_diff
<chr> <chr> <dbl>
1 B B 1 - B 2 45.1
2 B B 1 - B 3 26.1
3 B B 1 - B 4 15.3
4 B B 2 - B 3 19.7
5 B B 2 - B 4 30.4
6 B B 3 - B 4 10.9
7 D D 1 - D 2 39.8
8 D D 1 - D 3 9.42
9 D D 1 - D 4 38.6
10 D D 2 - D 3 30.6
11 D D 2 - D 4 1.24
12 D D 3 - D 4 29.4
CodePudding user response:
We can use outer
, calculations with the y
values and paste
with the Levelnames
, where we just use the upper.tri
in each case.
f <- \(x, y) abs(x - y)*100 / ((x y) / 2) ## your p_diff formula
p_diff <- outer(dt$y, dt$y, f) |>
{\(x) abs(x[upper.tri(x)])}() |>
round(3)
Pair <- outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
{\(x) x[upper.tri(x)]}()
res <- data.frame(Pair, p_diff)
Result
res
# Pair p_diff
# 1 B 1 - B 2 45.142
# 2 B 1 - B 3 26.058
# 3 B 2 - B 3 19.662
# 4 B 1 - B 4 15.272
# 5 B 2 - B 4 30.393
# 6 B 3 - B 4 10.894
# 7 B 1 - D 1 23.208
# 8 B 2 - D 1 66.605
# 9 B 3 - D 1 48.532
# 10 B 4 - D 1 38.142
# 11 B 1 - D 2 16.934
# 12 B 2 - D 2 28.758
# 13 B 3 - D 2 9.227
# 14 B 4 - D 2 1.672
# 15 D 1 - D 2 39.751
# 16 B 1 - D 3 13.862
# 17 B 2 - D 3 58.095
# 18 B 3 - D 3 39.563
# 19 B 4 - D 3 28.981
# 20 D 1 - D 3 9.422
# 21 D 2 - D 3 30.615
# 22 B 1 - D 4 15.705
# 23 B 2 - D 4 29.968
# 24 B 3 - D 4 10.460
# 25 B 4 - D 4 0.435
# 26 D 1 - D 4 38.561
# 27 D 2 - D 4 1.237
# 28 D 3 - D 4 29.407
Benchmark
I doubted that the tidy approach was faster, and I was right. Here I provide a benchmark comparing the solutions so far. Accordingly, the outer
approach is almost 20 times faster.
f1 <- \() data.frame(p_diff=outer(dt$y, dt$y, f) |>
{\(x) abs(x[upper.tri(x)])}() |>
round(3),
Pair=outer(dt$Levelname, dt$Levelname, paste, sep=' - ')|>
{\(x) x[upper.tri(x)]}())
library(dplyr);library(stringr)
f2 <- \() dt %>%
group_by(grp = str_extract(Levelname, "\\w "))%>%
summarise(pair = combn(Levelname, 2, str_c, collapse = " - "),
perc_diff = combn(y, 2, function(x) 200*abs(diff(x))/sum(x)),
.groups = 'drop')
dt <- dt[sample(nrow(dt), 1e3, replace=T), ]
microbenchmark::microbenchmark(outer=f1(), tidyverse=f2(), times=3L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# outer 236.7207 243.0496 306.1265 249.3785 340.8294 432.2804 3 a
# tidyverse 4819.3476 4830.7364 4838.5051 4842.1251 4848.0839 4854.0427 3 b