So, I'm working on a personal project that involves machine learning and I want to set up a training dataset and a test dataset such that the training dataset doesn't contaminate the test dataset.
I'm trying to predict winning teams in an online game.
I have 10,000 or so games. Each game has 10 players in it. Some players played in multiple games. I want to divide up the set of 10,000 into two distinct groups such that no player from group 1 played in a game in group 2.
What is an efficient algorithm to do this?
I could just try to brute force it. Come up with 2,000 of the 10,000 games and then check to see, for each game, that no player in the 2,000 games is in the 8,000 remaining games and, if the player is, iterate over the 10,000 games to find one that doesn't share a player with the remaining games.
But, I'm concerned that would take forever.
EDIT: The answer appears to be a breadth first search starting at a game and then getting all the games connected to it by shared players and doing that until the search ends, creating a cluster. Do that until no games are left and you have all the separate clusters.
CodePudding user response:
You can use Disjoint-set data structure to perform this task.
Algorithm:
ds // disjoint set object or data structure
for each round x:
for each player_a in round x:
for each player_b in round x:
ds.union(player_a, player_b)
Simply, perform a union on all the players who played in the same round for each round. This would create disjoint sets where each disjoint represents a group of players who have played strictly among themselves.
After applying the above algorithm,
- if you have exactly two disjoint sets then, you can say that players in set 1 did not play with any player in set 2.
- If you have only one set then, such partition is not possible.
- If more than two disjoint sets are present then, you can group some of the sets together until only 2 sets are remaining.
Considering 10,000 rounds and 10 players in each round, the above algorithm would roughly compute 106 * log2(106) operations which is approximately 2 * 107 operations. This would be easily computed within a second on any modern laptop/pc.