Background: The following problem occurred to me when I was trying to come up with a hiring challenge for software engineers in my company. I quickly realized that it was probably too difficult since I couldn't come up with a good solution myself :-) Still, I would be interested to know if somebody else can think of an efficient algorithm.
Consider a worker who is deciding which of a selection of continuing education courses to take. Each course will teach them a skill, which will boost their salary by a certain known amount. The worker wants the maximum salary boost, but there are some constraints to consider: Each course takes a month to complete, and the worker can only take one course at a time. They only have K months available for their education. Furthermore, some courses have prerequisites and can only been taken if the worker already completed all of the prerequisite courses. How can the worker find the curriculum that will give them the maximum raise?
More formally, consider a directed acyclic graph with N nodes, which have values v[0], ..., v[N-1] (all positive), and a list of M directed edges E = [(a[0],b[0]), ..., (a[M-1],b[M-1])]. For simplicity we can assume topological order (i.e. 0 <= a[i] < b[i] < N for i = 0, ..., M-1). What is the maximum sum of values of K nodes if we can only select a node if all of its ancestors in the DAG have been selected?
We can trivially solve this problem in O(M K * N^K) by looping over all size-K subsets and checking if all prerequisites are met. A simple Python implementation would be as follows:
def maxSumNodes(v, E, K):
# For each node, compute all of its direct ancestors in the DAG
N = len(v)
ancestors = [[] for i in range(N)]
for a, b in E:
ancestors[b].append(a)
maxSumValues = 0
for r in itertools.combinations(range(N), K):
sumValues = sum(v[i] for i in r)
nodesSelected = set(r)
if all(all(x in nodesSelected for x in ancestors[y]) for y in r):
maxSumValues = max(maxSumValues, sumValues)
return maxSumValues
However, this becomes prohibitively expensive if K is large (e.g. N = 1,000,000, K = 500,000). Is there a polynomial algorithm in N that works for any K? Alternatively, can it be proven that the problem is NP-hard?
CodePudding user response:
I found this algorithm, which only compares all k-sets with valid requirements
class Node(val value: Int, val children: List<Node> = emptyList())
fun maximise(activeNodes: List<Node>, capacity: Int) : Int {
if(capacity == 0 || activeNodes.isEmpty()) return 0
return activeNodes.maxOf { it.value maximise(activeNodes - it it.children, capacity - 1) }
}
val courses = listOf(Node(value = 1, children = listOf(Node(value = 20))), Node(value = 5))
val months = 2
maximise(courses, months)
(Building a DAG isn't an issue, so I'm just assuming my input is already in DAG form)
This algorithm will perform better than yours if there are lots of requirements. However, the worst case for this algorithm (no requirements) boils down to checking each possible k-set, meaning O(N^K).
The problem certainly looks NP-hard, but proving it is complicated. The closest to a proof I got is transforming it into a modified knapsack problem (which is NP-hard) (also mentioned by @user3386109 in the comments):
Start with a list of all paths in the DAG with their combined values as value and their length as weight. Now start solving your knapsack-problem, but whenever an item is picked,
- remove all subsets of that path
- modify all supersets of that path as following: reduce value by that paths value, reduce weight by that paths weight.
I think that makes this problem at least as hard as knapsack, but I can't prove it.
CodePudding user response:
This problem is NP-Complete. You can show that the problem of deciding, for a given vertex-weighted DAG and integers k
and M
, whether there exists a valid schedule with k
tasks of total value at least M
, is NP-Complete. The proof uses gadgets and a strategy borrowed from some of the first hardness proofs for scheduling of Garey and Johnson.
In fact, I'll prove the stronger statement that even restricting all weights to be 1
or 2
and the DAG to have a longest path and maximum in-degree of at most 2
, the problem is NP-Hard.
First, the problem is in NP: Given a witness (schedule) of k
tasks, we can test in linear time that the schedule is valid and has a value at least M
.
You can give a reduction from CLIQUE: Suppose we're given an instance of CLIQUE, which is an undirected simple graph G
with vertices V = {v1, v2...vq}
and edges E = {e1, e2,...er}
along with an integer p
, and we're asked whether G
has a clique of size p
.
We transform this into the following DAG-scheduling instance:
- Our vertex set (tasks) is
{V1, V2...Vq} U {E1, E2,...Er}
- All tasks
Vi
have value1
for1 <= i <= q
- All tasks
Ej
have value2
for1 <= j <= r
- There is an arc
(Vi, Ej)
in our DAG whenevervi
is an endpoint ofej
inG
. k = p(p 1)/2
M = p^2
The precedence constraints/DAG edges are precisely the requirement that we cannot perform an edge task Ej
until we've completed both vertex-tasks corresponding to its endpoints. We have to show that there is a valid schedule of k
tasks with value at least M
in our DAG if and only if G
has a clique of size p
.
Suppose we have a valid schedule of k
tasks with value at least M
. Since M
is p^2
, and k = p(p 1)/2
, one way of reaching this value is by completing at least p(p-1)/2
edge tasks, which have value 2
, in addition to any p
other tasks. In fact, we must complete at least this many edge tasks to reach value M
:
Letting E_completed
be the set of edge tasks in our schedule and V_completed
be the set of vertex tasks in our schedule, we get that
2|E_completed| |V_completed| >= p^2
and |E_completed| |V_completed| = p(p 1)/2
. Substituting variables and rearranging, we get that |V_completed| <= p
, meaning we've completed at most p
vertex tasks and thus at least p(p-1)/2
edge tasks.
Since G
is a simple graph, for p(p-1)/2
edges to have both endpoints covered by vertices, we must use at least p
vertices. So it must be that exactly p
vertex tasks were completed, and thus that exactly p(p-1)/2
edge tasks were completed, which, by the precedence constraints, imply that those p
corresponding vertices form a clique in G
.
For proving the other direction, it's immediate that given a clique in G
of size p
, we can choose the p
corresponding vertex tasks and p(p-1)/2
corresponding edge tasks to get a schedule of k
tasks of value M
. This concludes the proof.
Some closing notes on this fascinating problem and the winding journey of trying to solve it efficiently:
- CLIQUE is, in a sense, among the hardest of the NP-Complete problems, so a reduction from CLIQUE means that this problem inherits its hardness: the DAG scheduling problem is hard-to-approximate and fixed-parameter intractable, making it harder than Knapsack in that sense.
- Because the DAG scheduling only depends on reachability constraints, it seemed at first that an efficient solution could be possible. You can, in practice, take the transitive reduction of your graph, and then use a topological ordering where vertices are also ordered by their depth (longest path ending at a vertex).
- In practice, many interesting instances of this problem could be efficiently solvable (which is one reason I intuitively suspected the problem was not NP-Hard). If the DAG is deep and some vertices have many ancestors, rather than at most 2, we can do a depth-first search backwards through the topological order. A guess that one vertex is in our schedule forces all of its ancestors to also be in the schedule. This cuts the graph size considerably (possibly into many disconnected, easily solved components) and reduces
k
by the number of ancestors as well.
Overall, a very interesting problem-- I was unable to find the exact problem elsewhere, or even much information about the relevant structural properties of DAGs. However, CLIQUE can be used for many scheduling problems: the terminology and use of vertex and edge tasks is adapted from Garey and Johnson's 1976 hardness proof, (which I highly recommend reading) of a superficially different problem: scheduling an entire (unweighted) DAG, but where tasks have variable deadlines and unit processing time, and we only want to know whether we can make at most k
tasks late for their deadlines.