I have a dataframe with 4 columns.
set.seed(123)
df <- data.frame(A = round(rnorm(1000, mean = 1)),
B = rpois(1000, lambda = 3),
C = round(rnorm(1000, mean = -1)),
D = round(rnorm(1000, mean = 0)))
I would like to compute the distances for every possible combination of my columns (A-B, A-C, A-D, B-C, B-D, C-D) at every row of my dataframe.
This would be the equivalent of doing df$A - df$B
for every combination.
Can we use the dist()
function to compute this efficiently as I have a very large dataset? I would like to then convert the dist object into a data.frame
to plot the results with ggplot2
.
Unless there is a good tidy
version of doing the above.
Many Thanks
The closest I got was doing the below, but I am not sure to what the column names refer to.
d <- apply(as.matrix(df), 1, function(e) as.vector(dist(e)))
t(d)
CodePudding user response:
dist
will compare every value in a vector to every other value in the same vector, so if you are looking to compare columns row-by-row, this is not what you are looking for.
If you just want to calculate the difference between all columns pairwise, you can do:
df <- cbind(df,
do.call(cbind, lapply(asplit(combn(names(df), 2), 2), function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste(x, collapse = ""))
})))
head(df)
#> A B C D AB AC AD BC BD CD
#> 1 0 1 -2 -1 -1 2 1 3 2 -1
#> 2 1 1 -1 1 0 2 0 2 0 -2
#> 3 3 1 -2 -1 2 5 4 3 2 -1
#> 4 1 3 0 -1 -2 1 2 3 4 1
#> 5 1 3 0 1 -2 1 0 3 2 -1
#> 6 3 3 1 0 0 2 3 2 3 1
Created on 2022-06-14 by the reprex package (v2.0.1)
CodePudding user response:
Using base r:
df_dist <- t(apply(df, 1, dist))
colnames(df_dist) <- apply(combn(names(df), 2), 2, paste0, collapse = "_")
If you really want to use a tidy-approach, you could go with c_across
, but this also removes the names, and is much slower if your data is huge