Home > Software design >  How do I find the largest cluster in this simple dataset?
How do I find the largest cluster in this simple dataset?

Time:11-13

I have data on users and their interests. Some users have more interests than others. Data looks like below.

How do I find the largest cluster of users with the most interests in common? Formally, I am trying to maximize (number of users in cluster * number of shared interests in cluster)


In the data below, the largest cluster is:

CORRECT ANSWER

Users: [1,2,3]

Interests: [2,3]

Cluster-value: 3 users x 2 shared interests = 6


DATA

User 1: {3,2}

User 2: {3,2,4}

User 3: {2,3,8}

User 4: {7}

User 5: {7}

User 6: {9}

How do I find the largest cluster of users with the most interests in common?

Here would be a hypothetical data generation process:

import random 


# Generate 300 random (user, interest) tupples
def generate_data():
  data = []
  while len(data) < 300:
    data_pt = {"user": random.randint(1,100), "interest":random.randint(50)}
    if data_pt not in data:
      data.append(data_pt)
  return data

def largest_cluster(data):
  return None 


UPDATE: As somebody pointed out, the data is too parse. In the real case, there would be more users than interests. So I have updated the data generating process.

CodePudding user response:

This looks to me like the kind of combinatorial optimization problem which would fall into the NP-Hard complexity class, which would of course mean that it's intractable to find an exact solution for instances with more than ~30 users.

Dynamic Programming would be the tool you'd want to employ if you were to find a usable algorithm for a problem with an exponential search space like this (here the solution space is all 2^n subsets of users), but I don't see DP helping us here because of the lack of overlapping sub-problems. That is, for DP to help, we have to be able to use and combine solutions to smaller sub-problems into an overall solution in polynomial time, and I don't see how we can do that for this problem.

Imagine you have a solution for a size=k problem, using a limited subset of the users {u1, u2,...uk} and you want to use that solution to find the new solution when you add another user u(k 1). The issue is the solution set in the incrementally larger instance might not overlap at all with the previous solution (it may be an entirely different group of users/interests), so we can't effectively combine solutions to subproblems to get the overall solution. And if instead of trying to just use the single optimal solution for the size k problem to reason about the size k 1 problem you instead stored all possible user combinations from the smaller instance along with their scores, you could of course quite easily do set intersections across these groups' interests with the new user's interests to find the new optimal solution. However, the problem with this approach is of course that the information you have to store would double with iteration, yielding an exponential time algorithm not better than the brute force solution. You run into similar problems if you try to base your DP off incrementally adding interests rather than users.

So if you know you only have a few users, you can use the brute force approach: generating all user combinations, taking a set intersection of each combination's interests, scoring and saving the max score. The best way to approach larger instances would probably be with approximate solutions through search algorithms (unless there is a DP solution I don't see). You could iteratively add/subtracts/swap users to improve the score and climb towards towards an optimum, or use a branch-and-bound algorithm which systematically explores all user combinations but stops exploring any user-subset branches with null interest intersection (as adding additional users to that subset will still produce a null intersection). You might have a lot of user groups with null interest intersections, so this latter approach could be quite quick practically speaking by its pruning off large parts of the search space, and if you ran it without a depth limit it would find the exact solution eventually.

Branch-and-bound would work something like this:

def getLargestCluster((user, interest)[]):
  
  userInterestDict := { user -> {set of user's interests} } # build a dict

  # generate and score user clusters
  users := userInterestDict.keys() # save list of users to iterate over
  bestCluster, bestInterests, bestClusterScore := {}, {}, 0
  generateClusterScores()
  
  return [bestCluster, bestInterests bestClusterScore]

# (define locally in getLargestCluster or pass needed values
def generateClusterScores(i = 0, userCluster = {}, clusterInterests = {}):
  curScore := userCluster.size * clusterInterests.size
  if curScore > bestScore:
    bestScore, bestCluster, bestInterests  := curScore, curCluster, clusterInterests

  if i = users.length: return

  curUser := users[i]
  curInterests := userInterestDict[curUser]
  newClusterInterests := userCluster.size = 0 ? curInterests : setIntersection(clusterInterests, curInterests)

  # generate rest subsets with and without curUser (copy userCluster if pass by reference)
  generateClusterScores(i 1, userCluster, clusterInterests)
  if !newClusterInterests.isEmpty(): # bound the search here
    generateClusterScores(i 1, userCluster.add(curUser), newClusterInterests)

You might be able to do a more sophisticated bounding (like if you can calculate that the current cluster score couldn't eclipse your current best score, even if all the remaining users were added to the cluster and the interest intersection stayed the same), but checking for an empty interest intersection is simple enough. This works fine for 100 users, 50 interests though, up to around 800 data points. You could also make it more efficient by iterating over the minimum of |interests| and |users| (to generate fewer recursive calls/combinations) and just mirror the logic for the case where interests is lower. Also, you get more interesting clusters with fewer users/interests

  • Related