gfg_data <- data.frame(
year = c(2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021, 2021, 2022, 2022, 2022),
Timings = c(5, 6, 4, 2, 3, 4, 11, 13, 15, 14, 17, 12)
)
This is a much more simplified dataset compared to what I'm using. Essentially, I'd like to find out the years that are most similar in terms of timings. So, I'd like to be able to see that 2019 and 2020 are similar and 2021/2022 are similar. My original dataset has 500 variables, so it won't be as simple as looking through the data and noting down what is similar.
CodePudding user response:
Given distance 5
(exclusive) as the threshold for clustering the values, you can try igraph
like below
library(igraph)
df %>%
mutate(group = graph_from_adjacency_matrix(as.matrix(dist(Timings)) < 5, "undirected") %>%
components() %>%
membership())
which gives
year Timings group
1 2019 5 1
2 2019 6 1
3 2019 4 1
4 2020 2 1
5 2020 3 1
6 2020 4 1
7 2021 11 2
8 2021 13 2
9 2021 15 2
10 2022 14 2
11 2022 17 2
12 2022 12 2
If you already have the number of clusters in you mind, say, 2
, you can use kmeans
like below
> transform(df, group = as.integer(factor(kmeans(Timings, 2)$cluster)))
year Timings group
1 2019 5 1
2 2019 6 1
3 2019 4 1
4 2020 2 1
5 2020 3 1
6 2020 4 1
7 2021 11 2
8 2021 13 2
9 2021 15 2
10 2022 14 2
11 2022 17 2
12 2022 12 2
CodePudding user response:
The question did not specify what close means so we will use correlation. We see that 2019 and 2020 have correlation of -0.5 and 2019 and 2021 have correlation of 0.99 . 2021 and 2022 have a correlation of -0.397 .
m <- do.call("cbind", with(gfg_data, split(Timings, year)))
cor(m)
## 2019 2020 2021 2022
## 2019 1.0000000 -0.5000000 -0.5000000 0.9933993
## 2020 -0.5000000 1.0000000 1.0000000 -0.3973597
## 2021 -0.5000000 1.0000000 1.0000000 -0.3973597
## 2022 0.9933993 -0.3973597 -0.3973597 1.0000000
Another possibility is to use root mean squared
rms <- function(i, j) sqrt(sum((m[, i]-m[, j])^2))
nc <- ncol(m)
matrix(outer(1:nc, 1:nc, Vectorize(rms)), nc, nc,
dimnames = list(colnames(m), colnames(m)))
## 2019 2020 2021 2022
## 2019 0.000000 4.242641 14.352700 16.309506
## 2020 4.242641 0.000000 17.378147 20.099751
## 2021 14.352700 17.378147 0.000000 5.830952
## 2022 16.309506 20.099751 5.830952 0.000000
or max absolute value of the difference
maxabs <- function(i, j) max(abs(m[, i] - m[, j]))
nc <- ncol(m)
out.maxabs <- outer(1:nc, 1:nc, Vectorize(rms))
dimnames(out.maxabs) <- list(colnames(m), colnames(m))
out.maxabs
## 2019 2020 2021 2022
## 2019 0.000000 4.242641 14.352700 16.309506
## 2020 4.242641 0.000000 17.378147 20.099751
## 2021 14.352700 17.378147 0.000000 5.830952
## 2022 16.309506 20.099751 5.830952 0.000000
CodePudding user response:
An approach is using hierarchical clustering, e.g. with hclust
cbind(gfg_data, grp = cutree(hclust(dist(gfg_data$Timings)), k=2))
year Timings grp
1 2019 5 1
2 2019 6 1
3 2019 4 1
4 2020 2 1
5 2020 3 1
6 2020 4 1
7 2021 11 2
8 2021 13 2
9 2021 15 2
10 2022 14 2
11 2022 17 2
12 2022 12 2