I'm trying to calculate the gower::gower_dist() index of a subset of a list nested with the following subset.
i.e., I have a nested list with ten lines in each subset.
I would like to:
Calculate the gower::gower_dist() index for the first set of 10 rows with the next set, then the first with the third, and so on.
Calculate an average value of each iteration
Order from highest to lowest to identify the comparison set that had the highest mean value
A reproducible example:
list_to_split <- data.frame(rnorm(100), rnorm(100), rnorm(100))
names(list_to_split) <- c("var_1", "var_2", "var_3")
n <- 10
nr <- nrow(list_to_split)
nested_list<-split(list_to_split[,c(1:3)], rep(1:ceiling(nr/n), each=n, length.out=nr))
Below is a piece of the calculus I'm intending to do:
dt_1 <- list_to_split[c(1:10),]
dt_2 <- list_to_split[c(11:20),]
gower_test <- gower::gower_dist(dt_1, dt_2)
mean(gower_test[[1]])
> gower::gower_dist(dt_1, dt_2)
[1] 0.45988316 0.04906887 0.31952329 0.54794324 0.23139261 0.26743197 0.27649944 0.35229745 0.19163644 0.20118909
> mean(gower_test[[1]])
[1] 0.4598832
The above example is for only the first one with the second one. I would like to perform for the entire list and test all combinations
CodePudding user response:
library(tidyr)
library(purrr)
library(dplyr)
# create a nested data frame where we have created 10 lists
# for rows 1..10, 11..20, etc
df <- list_to_split %>%
mutate(row_id = (row_number()-1) %/% 10 1) %>%
group_by(row_id) %>%
nest()
# create cartesian product
crossing(a = df, b = df) %>%
# compute gdist for each combo
mutate(gdist = map2(a$data, b$data, gower::gower_dist)) %>%
# compute avg value for each
mutate(gavg = map_dbl(gdist, mean)) %>%
# order
arrange(-gavg)