Best way to find similar groups in a dataset based on 1 variable in R?-CodePudding

gfg_data <- data.frame(
  year = c(2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021, 2021, 2022, 2022, 2022),
  Timings = c(5, 6, 4, 2, 3, 4, 11, 13, 15, 14, 17, 12)
)

This is a much more simplified dataset compared to what I'm using. Essentially, I'd like to find out the years that are most similar in terms of timings. So, I'd like to be able to see that 2019 and 2020 are similar and 2021/2022 are similar. My original dataset has 500 variables, so it won't be as simple as looking through the data and noting down what is similar.

CodePudding user response：

Given distance 5 (exclusive) as the threshold for clustering the values, you can try igraph like below

library(igraph)

df %>%
  mutate(group = graph_from_adjacency_matrix(as.matrix(dist(Timings)) < 5, "undirected") %>%
    components() %>%
    membership())

which gives

   year Timings group
1  2019       5     1
2  2019       6     1
3  2019       4     1
4  2020       2     1
5  2020       3     1
6  2020       4     1
7  2021      11     2
8  2021      13     2
9  2021      15     2
10 2022      14     2
11 2022      17     2
12 2022      12     2

If you already have the number of clusters in you mind, say, 2, you can use kmeans like below

> transform(df, group = as.integer(factor(kmeans(Timings, 2)$cluster)))
   year Timings group
1  2019       5     1
2  2019       6     1
3  2019       4     1
4  2020       2     1
5  2020       3     1
6  2020       4     1
7  2021      11     2
8  2021      13     2
9  2021      15     2
10 2022      14     2
11 2022      17     2
12 2022      12     2

CodePudding user response：

The question did not specify what close means so we will use correlation. We see that 2019 and 2020 have correlation of -0.5 and 2019 and 2021 have correlation of 0.99 . 2021 and 2022 have a correlation of -0.397 .

m <- do.call("cbind", with(gfg_data, split(Timings, year)))
cor(m)
##            2019       2020       2021       2022
## 2019  1.0000000 -0.5000000 -0.5000000  0.9933993
## 2020 -0.5000000  1.0000000  1.0000000 -0.3973597
## 2021 -0.5000000  1.0000000  1.0000000 -0.3973597
## 2022  0.9933993 -0.3973597 -0.3973597  1.0000000

Another possibility is to use root mean squared

rms <- function(i, j) sqrt(sum((m[, i]-m[, j])^2))
nc <- ncol(m)
matrix(outer(1:nc, 1:nc, Vectorize(rms)), nc, nc, 
  dimnames = list(colnames(m), colnames(m)))
##           2019      2020      2021      2022
## 2019  0.000000  4.242641 14.352700 16.309506
## 2020  4.242641  0.000000 17.378147 20.099751
## 2021 14.352700 17.378147  0.000000  5.830952
## 2022 16.309506 20.099751  5.830952  0.000000

or max absolute value of the difference

maxabs <- function(i, j) max(abs(m[, i] - m[, j]))
nc <- ncol(m)
out.maxabs <- outer(1:nc, 1:nc, Vectorize(rms))
dimnames(out.maxabs) <- list(colnames(m), colnames(m))
out.maxabs
##           2019      2020      2021      2022
## 2019  0.000000  4.242641 14.352700 16.309506
## 2020  4.242641  0.000000 17.378147 20.099751
## 2021 14.352700 17.378147  0.000000  5.830952
## 2022 16.309506 20.099751  5.830952  0.000000

CodePudding user response：

An approach is using hierarchical clustering, e.g. with hclust

cbind(gfg_data, grp = cutree(hclust(dist(gfg_data$Timings)), k=2))
   year Timings grp
1  2019       5   1
2  2019       6   1
3  2019       4   1
4  2020       2   1
5  2020       3   1
6  2020       4   1
7  2021      11   2
8  2021      13   2
9  2021      15   2
10 2022      14   2
11 2022      17   2
12 2022      12   2