Home > database >  Algorithm/method for grouping items based on their relative distance
Algorithm/method for grouping items based on their relative distance

Time:11-10

I'm looking for a method to classify a set of items based on their relative distance.

For example assume we have 4 cities and we know their relative distance:

city1 city2 city3 city4
0 2.1 2.2 3.4
2.1 0 2.2 2.1
2.2 2.2 0 1.4
3.4 2.1 1.4 0

If we try to categorize these 4 cities into two categories, intuitively we'll end up with

(city1, city2) and (city3, city4)

by minimizing the inner distance between cities.

Based on my understanding of K-Means, this method works when we have a common distance metric to say X-axis. Here, we don't know how the distance is calculated, so there isn't a clear way to decompose distance to something in common.

Is there a method to categorize these cities?

CodePudding user response:

You can try solving the task as a graph problem. Let city be vertex and road between city be edge. Then you can use minimum spanning tree clusterization: you start building minimum spanning tree, but stop in the middle of the process when you have required number of clusters.

Let's do it with a help of Kruskal's algorithm

  1. We start from 4 clusters: {city1}, {city2}, {city3}, {city4}
  2. We get the shortest edge, it is city3 - city4 (1.4), we eliminate it and have 3 clusters: {city1}, {city2}, {city3, city4}
  3. We get the next shortest edge: it is either city1 - city1 or city2 - city4 (both has 2.1 length) depending on tie resolution we have 2 clusters: either {city1, city2}, {city3, city4} or {city1}, {city2, city3, city4}
  4. Here (at 2 clusters) we stop and get either {city1, city2}, {city3, city4} or {city1}, {city2, city3, city4} clusterization

You can use a little bit less efficient, but easy to code approach:

  1. Build minimum spanning tree by any algorithm (Kruskal's, Prim's) by any library (say, scipy)
  2. Now you have all the cities in one cluster
  3. Start dropping the longest edges from this tree: with one longest edge dropped you'll get 2 clusters (subtrees), with k longest edges removed - k 1 clusters.
  • Related