I'm looking for a method to classify a set of items based on their relative distance.
For example assume we have 4 cities and we know their relative distance:
city1 | city2 | city3 | city4 |
---|---|---|---|
0 | 2.1 | 2.2 | 3.4 |
2.1 | 0 | 2.2 | 2.1 |
2.2 | 2.2 | 0 | 1.4 |
3.4 | 2.1 | 1.4 | 0 |
If we try to categorize these 4 cities into two categories, intuitively we'll end up with
(city1, city2) and (city3, city4)
by minimizing the inner distance between cities.
Based on my understanding of K-Means, this method works when we have a common distance metric to say X-axis. Here, we don't know how the distance is calculated, so there isn't a clear way to decompose distance to something in common.
Is there a method to categorize these cities?
CodePudding user response:
You can try solving the task as a graph problem. Let city be vertex and road between city be edge. Then you can use minimum spanning tree clusterization: you start building minimum spanning tree, but stop in the middle of the process when you have required number of clusters.
Let's do it with a help of Kruskal's algorithm
- We start from 4 clusters:
{city1}, {city2}, {city3}, {city4}
- We get the shortest edge, it is
city3 - city4
(1.4
), we eliminate it and have 3 clusters:{city1}, {city2}, {city3, city4}
- We get the next shortest edge: it is either
city1 - city1
orcity2 - city4
(both has2.1
length) depending on tie resolution we have 2 clusters: either{city1, city2}, {city3, city4}
or{city1}, {city2, city3, city4}
- Here (at 2 clusters) we stop and get either
{city1, city2}, {city3, city4}
or{city1}, {city2, city3, city4}
clusterization
You can use a little bit less efficient, but easy to code approach:
- Build minimum spanning tree by any algorithm (Kruskal's, Prim's) by any library (say,
scipy
) - Now you have all the cities in one cluster
- Start dropping the longest edges from this tree: with one longest edge dropped you'll get 2 clusters (subtrees), with k longest edges removed - k 1 clusters.