Definitions
A parent record has the type 'P', an ancestor key, a date interval.
Its child record has the type 'C', an identical ancestor key, and has a date interval that matches or falls within its parent's interval.
- All records are unique
- Parent records can share the same ancestor key, but their date intervals cannot overlap
- A parent record can have many child records
Example
Parent records can be:
- P, 12345, (1000-01-01, 1000-12-31)
- P, 12345, (1001-01-01, 1001-12-31) // No overlapping dates, valid
Valid children for the first parent record can be
- C, 12345, (1000-01-01, 1000-12-31) // Matches on everything, valid
- C, 12345, (1000-05-05, 1000-09-09) // Matches on ancestor key, is within parent's date interval, valid
Problem
Given a randomized set of records with both parents and children, how can I efficiently categorize the set into different groups of one unique parent and its valid children based on both key and time intervals?
It is guaranteed that for every child, there is one and only one parent. but it is possible that a parent does not have any children.
Brute force solution
Identify all of the parent records in linear time. Then loop through them and pairwise match all of the other records in squared time.
Question
Is there a faster approach?
CodePudding user response:
The easiest way is to make a list of P records and a list of C records, and then sort both of them by (ancestor_key, interval.start)
Then you walk through the parents list, and for each parent, extract it's children from the child list. Because of the sorting, the parent and child lists will be in corresponding order, so the position of interest in both lists will move only forward.
Total complexity is O(n log n), dominated by the sorting.