Is there an algorithm similar to the bloom filter, that allows you to:
- Compactly represent two (large) sets independently of each other and
- probabilistically check for disjointness between them using their lossy-compressed representations?
In the special case where one set only has a single same and you don't compress it, this problem reduces to probablistic set membership, for which one can consider the Bloom filter.
CodePudding user response:
OK, so here's a terrible answer just to get the discussion going:
You could compress both sets as Bloom filters. Choose a very large number of random samples from the larger set both sets are drawn from.
For each sample you pick, look it up in the bloom filters.
If both indicate "probable member", then you have found a "probable intersection" and the sets are "probably overlapping". Otherwise you declare that the sets are "probably not overlapping".
CodePudding user response:
The Apache datasketches library has interesting functionality:
https://datasketches.apache.org/docs/DistinctCountFeaturesMatrix.html
"Theta sketches" support intersection, and count operators.