Dependency Algorithm - find a minimum set of packages to install

Question

I'm working on an algorithm which goal is to find a minimum set of packages to install package "X".

I'll explain better with an example:

X depends on A and (E or C)
A depends on E and (H or Y)
E depends on B and (Z or Y)
C depends on (A or K)
H depends on nothing
Y depends on nothing
Z depends on nothing
K depends on nothing

The solution is to install: A E B Y.

Here is an image to describe the example:

Is there an algorithm to solve the problem without using a brute-force approach?

I've already read a lot about algorithms such as DFS, BFS, Dijkstra, etc... The problem is that these algorithms are unable to handle the "OR" condition.

UPDATE

I don't want to use external libraries.

The algorithm doesn't have to handle circular dependencies.

UPDATE

One possible solution could be to calculate all the possible paths of each vertex and, for each vertex in the possible path, doing the same. So, the possible path for X would be (A E),(A C). Now, for each element in those two possible paths we can do the same: A = (E H),(E Y) / E = (B Z),(B Y), and so on... At the end we can combine the possible paths of each vertex in a SET and choose the one with minimum length.

What do you think?

Is a blue dependency required and *one* of the red dependencies required? — aioobe, May 28 '15 at 12:52
To get all the minor number of dependencies needed of a package u need al the AND dependency and as many or dependency as needed. I will make u another example. A = B,C, D|F|G,H|L Therefore to get the least number of dependencies for A u need B,C, only 1 between [ D F G ] and one between [ H L ] The way to choose between the or's is to get the shortest path. — KLi, May 28 '15 at 12:57
Caution, your color code is insufficient, it doesn't tell you where to place the parenthesis. — Yves Daoust, May 28 '15 at 14:15
@KLi, I suggest to post your solution as an answer instead of placing it in the question. So it will be possible to comment that directly. Also I think, the picture doesn't help to understand your example at all; better to remove it. — dened, May 30 '15 at 08:20
you can try an approximation algorithm using greedy (local) approach with some local search as well if needed, by taking the dependency with the minimum number of sub-dependncies at each step — Nikos M., May 30 '15 at 20:56
X depends on A and (E or C).... can it be (D or E or C)... what happens when A->B & B->A. What is your plan for avoiding deadlocks. The concept of context free grammar also has a set of rules. Terminal Symbols. You can check that out on https://www.youtube.com/watch?v=9XKUcm8au4U — blueray, May 30 '15 at 22:54
@Ahmmad Ismail as I've written I don't have to handle cycles. — KLi, May 31 '15 at 00:48

dened · Answer 1 · 2015-05-30T13:33:46.260

8

Unfortunately, there is little hope to find an algorithm which is much better than brute-force, considering that the problem is actually NP-hard (but not even NP-complete).

A proof of NP-hardness of this problem is that the minimum vertex cover problem (well known to be NP-hard and not NP-complete) is easily reducible to it:

Given a graph. Let's create package P_v for each vertex v of the graph. Also create package X what "and"-requires (P_u or P_v) for each edge (u, v) of the graph. Find a minimum set of packages to be installed in order to satisfy X. Then v is in the minimum vertex cover of the graph iff the corresponding package P_v is in the installation set.

edited May 30 '15 at 13:33

answered May 30 '15 at 07:14

dened

4,079
16
30

This proof holds up even when circular dependencies are not allowed, and even when no more than two alternatives in OR-clause are allowed. – dened May 30 '15 at 10:47
1

Saying that minimum set (or vertex) cover is NP-hard but not NP-complete is correct, but likely to confuse people unnecessarily. All such discrete optimisation problems ("Find the minimum ...") can be trivially changed into equivalent decision problems ("Does there exist a ... with size <= k?") that *are* NP-complete. – j_random_hacker May 31 '15 at 18:40
@j_random_hacker, a problem what is not NP-complete cannot be changed to an equivalent NP-complete problem, because If there is an equivalent NP-complete problem for some problem, then this problem itself should be NP-complete.... And probably, explicitly stating that a problem is both NP-hard and not NP-complete is justified, as it tells that a solution to the problem is not only hard to find, but also hard to verify. – dened May 31 '15 at 21:23
By "equivalent", I meant in the sense that an algorithm for one is poly-time-convertible into an algorithm for the other. E.g., to turn a minimisation problem into polynomially many decision problems, simply solve the decision problem for k=0, 1, 2, ... until a YES answer is returned. Then, to recover the solution, repeatedly delete 1 artbitrary item (i.e., 1 vertex for vertex cover) from the set of possibilities and solve the decision problem for k'=k-1: include the vertex in the solution iff the answer is NO. – j_random_hacker May 31 '15 at 22:02
@j_random_hacker, I'm aware that any discrete optimization problem can be solved by solving series of decision problems. This might be useful in problem analysis, but cannot make solution calculation nor verification easier, i.e. it's still impossible to change any non-NP-complete NP-hard problem to NP-complete one. So, your comment still doesn't make any sense for me... There is another reason to indicate that the problem is both NP-hard and not NP-complete: any NP-complete problem can be solved in polynomial time if suddenly P=NP, but this is not the case with NP-hard problems in general. – dened Jun 01 '15 at 06:53
The only difference I see between an NPC problem and an NPH problem that is not NPC but can be converted to a polynomial number of instances of an NPC problem, is that solutions to the latter type are *never* accompanied by a short certificate (because they depend on finding at least 1 NO-solution to an NPC problem). But is this "certificatelessness" such a big deal? Whenever an NPC problem instance happens to have the solution NO, there is no certificate for that either -- IOW, roughly "half" of the answers to NPC problems have no certificate! – j_random_hacker Jun 01 '15 at 11:02
@j_random_hacker, it can be a big deal. For example, it might be important to know that there is no efficient algorithm to check that a vertex cover you found is actually a minimum or not. – dened Jun 01 '15 at 13:12
Fair enough. There is a distinction, and your last point gives a consequence of practical importance, since it would be nice to have a way to quickly test heuristic solutions for optimality. If you edit to mention the poly-time-equivalence property I mentioned earlier, I'll certainly +1. – j_random_hacker Jun 01 '15 at 13:52
@j_random_hacker, but can you first tell why do you think this property is still important to mention? – dened Jun 01 '15 at 16:13
I think a *high quality* answer deserving of an upvote would clarify this, because in most working programmers' understanding, there is no (or no important) distinction between these two categories (if they know them at all). Even many CS papers talk loosely about "NP-complete" optimisation problems, with the understanding that they can be interconverted with NPC decision problems. You could certainly say that it doesn't need to be mentioned because it's not directly relevant to the OP's question, but IMHO that only means that not mentioning does not deserve a downvote. – j_random_hacker Jun 02 '15 at 10:35

score 2 · Answer 2 · answered May 28 '15 at 10:01

2

"I dint get the problem with "or" (the image is not loading for me). Here is my reasoning . Say we take standard shortest route algo like Dijkstras and then use equated weightage to find the best path . Taking your example Select the best Xr from below 2 options

Xr= X+Ar+Er
Xr= X+Ar+Cr

where Ar = is the best option from the tree A=H(and subsequent child's) or A=Y(and subsequent childs)

The idea is to first assign standard weight for each or option (since and option is not a problem) . And later for each or option we repeat the process with its child nodes till we reach no more or option .

However , we need to first define , what best choice means, assume that least number of dependencies ie shortest path is the criteria . The by above logic we assign weight of 1 for X. There onwards

X=1
X=A and E or C hence X=A1+E1 and X=A1+C1
A= H or Y, assuming H and Y are  leaf node hence A get final weight as 1
hence , X=1+E1 and X=1+C1

Now for E and C
E1=B1+Z1 and B1+Y1 . C1=A1 and C=K1.
Assuming B1,Z1,Y1,A1and K1 are leaf node 

E1=1+1 and 1+1 . C1=1 and C1=1
ie E=2 and C=1

Hence
X=1+2 and X=1+1 hence please choose X=>C as the best route

Hope this clears it . Also we need to take care of cyclical dependencies X=>Y=>Z=>X , here we may assign such nodes are zero at parent or leaf node level and take care of dependecy."

answered May 28 '15 at 10:01

mrsachindixit

74
4

sure, do update us on your findings .Also the small case where 2 or more options with same weightage is found needs to be covered . An additional criteria can be applied to make choice in such a case . – mrsachindixit May 28 '15 at 11:32
well what if X depends on A and ( B or C ) and B depends on D and (F or G) ? How can the algorithm handle that in order to take B as a dependency of A, you MUST have also D since its necessary for B ? – KLi May 28 '15 at 14:35
Thats why I included Ar notation . Ar = is the best option from the tree A=H(and subsequent child's) or A=Y(and subsequent childs) . Now as per your new case X=A and Br or Cr ,Our idea is to assign default weight of 1 to a root till its complete child tree is resolved and calculated . After that the root node gets the new weightage instead of original default weight .See in original question also E has same behavior . X depends on A and (E or C) A depends on E and (H or Y) .. .. .. – mrsachindixit May 30 '15 at 11:02
So our algo essentially is "for each optional node ie or option , assign default weight and then revise its weight once actual weight of the optional node has been calculated" So there will be fair amount of back and forth ie looping /recursion in the process of the assign and revise cycle (or assign-flag and revise-de-falg to be precise) . Once we calculate such weight and find out our shorted path we can use that path info for any real world usage (ie say for setting build path or download sequence of jars ) – mrsachindixit May 30 '15 at 11:02

Xceptional · Answer 3 · 2015-05-29T16:09:13.777

I actually think graphs are the appropriate structure for this problem. Note that A and (E or C) <==> (A and E) or (A and C). Thus, we can represent X = A and (E or C) with the following set of directed edges:

A <- K1
E <- K1
A <- K2
C <- K2
K1 <- X
K2 <- X

Essentially, we're just decomposing the logic of the statement and using "dummy" nodes to represent the ANDs.

Suppose we decompose all the logical statements in this fashion (dummy Ki nodes for ANDS and directed edges otherwise). Then, we can represent the input as a DAG and recursively traverse the DAG. I think the following recursive algorithm could solve the problem:

Definitions:
Node u - Current Node.
S - The visited set of nodes.
children(x) - Returns the out neighbors of x.

Algorithm:

shortestPath u S = 
if (u has no children) {
    add u to S
    return 1
} else if (u is a dummy node) {
  (a,b) = children(u)
  if (a and b are in S) {
    return 0
  } else if (b is in S) { 
    x = shortestPath a S
    add a to S
    return x
  } else if (a in S) {
    y = shortestPath b S
    add b to S
    return y
  } else {
    x = shortestPath a S
    add a to S
    if (b in S) return x
    else {
        y = shortestPath b S
        add b to S
        return x + y
    }
  }
} else {
  min = Int.Max
  min_node = m
  for (x in children(u)){
    if (x is not in S) {
      S_1 = S
      k = shortestPath x S_1
      if (k < min) min = k, min_node = x
    } else {
      min = 1
      min_node = x
    }
  }
  return 1 + min
}

Analysis: This is an entirely sequential algorithm that (I think) traverses each edge at most once.

score 2 · Answer 4 · answered Jun 10 '15 at 20:44

A lot of the answers here focus on how this is a theoretically hard problem due to its NP-hard status. While this means you will experience asymptotically poor performance exactly solving the problem (given current solution techniques), you may still be able to solve it quickly (enough) for your particular problem data. For instance, we are able to exactly solve enormous traveling salesman problem instances despite the fact that the problem is theoretically challenging.

In your case, a way to solve the problem would be to formulate it as a mixed integer linear program, where there is a binary variable x_i for each package i. You can convert requirements A requires (B or C or D) and (E or F) and (G) to constraints of the form x_A <= x_B + x_C + x_D ; x_A <= x_E + x_F ; x_A <= x_G, and you can require that a package P be included in the final solution with x_P = 1. Solving such a model exactly is relatively straightforward; for instance, you can use the pulp package in python:

import pulp

deps = {"X": [("A"), ("E", "C")],
        "A": [("E"), ("H", "Y")],
        "E": [("B"), ("Z", "Y")],
        "C": [("A", "K")],
        "H": [],
        "B": [],
        "Y": [],
        "Z": [],
        "K": []}
required = ["X"]

# Variables
x = pulp.LpVariable.dicts("x", deps.keys(), lowBound=0, upBound=1, cat=pulp.LpInteger)

mod = pulp.LpProblem("Package Optimization", pulp.LpMinimize)

# Objective
mod += sum([x[k] for k in deps])

# Dependencies
for k in deps:
    for dep in deps[k]:
        mod += x[k] <= sum([x[d] for d in dep])

# Include required variables
for r in required:
    mod += x[r] == 1

# Solve
mod.solve()
for k in deps:
    print "Package", k, "used:", x[k].value()

This outputs the minimal set of packages:

Package A used: 1.0
Package C used: 0.0
Package B used: 1.0
Package E used: 1.0
Package H used: 0.0
Package Y used: 1.0
Package X used: 1.0
Package K used: 0.0
Package Z used: 0.0

For very large problem instances, this might take too long to solve. You could either accept a potentially sub-optimal solution using a timeout (see here) or you could move from the default open-source solvers to a commercial solver like gurobi or cplex, which will likely be much faster.

Valentas · Answer 5 · 2015-05-31T07:22:19.187

1

To add to Misandrist's answer: your problem is ~~NP-complete~~ NP-hard (see dened's answer).

Edit: Here is a direct reduction of a Set Cover instance (U,S) to your "package problem" instance: make each point z of the ground set U an AND requirement for X. Make each set in S that covers a point z an OR requirement for z. Then the solution for package problem gives the minimum set cover.

Equivalently, you can ask which satisfying assignment of a monotone boolean circuit has fewest true variables, see these lecture notes.

edited May 31 '15 at 07:22

answered May 28 '15 at 11:13

Valentas

1,536
14
22

1

Reducing a problem to Set cover doesn't show that it's NP-complete. Instead you need to reduce an NP-complete problem to this problem. For example, I can reduce 2-SAT to 3-SAT, but 2-SAT is easy and 3-SAT is NP-complete. – Paul Hankin May 28 '15 at 12:02
@Valentas, are you sure about the reduction? In this case the graph most likely forms a DAG. – aioobe May 28 '15 at 12:59
The OP did say "it doesn't have to handle cycles" – alexis May 28 '15 at 18:20
The constructions I described are directed acyclic graphs, just as in the picture in the question. – Valentas May 28 '15 at 18:41
1

The reduction is correct, but this problem is not NP-compete, because minimal set cover problem itself is not NP-compete. Also reduction of minimal vertex cover gives a more general proof, as it requires only two OR-alternatives; see my [answer](http://stackoverflow.com/a/30543429/2266855). – dened May 31 '15 at 05:01

Paul · Answer 6 · 2015-06-01T12:14:25.133

Since the graph consists of two different types of edges (AND and OR relationship), we can split the algorithm up into two parts: search all nodes that are required successors of a node and search all nodes from which we have to select one single node (OR).

Nodes hold a package, a list of nodes that must be successors of this node (AND), a list of list of nodes that can be successors of this node (OR) and a flag that marks on which step in the algorithm the node was visited.

define node: package p , list required , listlist optional , 
             int visited[default=MAX_VALUE]

The main-routine translates the input into a graph and starts traversal at the starting node.

define searchMinimumP:
    input: package start , string[] constraints
    output: list

    //generate a graph from the given constraint
    //and save the node holding start as starting point
    node r = getNode(generateGraph(constraints) , start)

    //list all required nodes
    return requiredNodes(r , 0)

requiredNodes searches for all nodes that are required successors of a node (that are connected to n via AND-relation over 1 or multiple edges).

define requiredNodes:
    input: node n , int step
    output: list

    //generate a list of all nodes that MUST be part of the solution
    list rNodes
    list todo

    add(todo , n)

    while NOT isEmpty(todo)
        node next = remove(0 , todo)
        if NOT contains(rNodes , next) AND next.visited > step
            add(rNodes , next)
            next.visited = step

    addAll(rNodes , optionalMin(rNodes , step + 1))

    for node r in rNodes
        r.visited = step

    return rNodes

optimalMin searches for the shortest solution among all possible solutions for optional neighbours (OR). This algorithm is brute-force (all possible selections for neighbours will be inspected.

define optionalMin:
    input: list nodes , int step
    output: list

    //find all possible combinations for selectable packages
    listlist optSeq
    for node n in nodes
        if NOT n.visited < step
            for list opt in n.optional
                add(optSeq , opt)

    //iterate over all possible combinations of selectable packages
    //for the given list of nodes and find the shortest solution
    list shortest
    int curLen = MAX_VALUE

    //search through all possible solutions (combinations of nodes)
    for list seq in sequences(optSeq)
        list subseq

        for node n in distinct(seq)
            addAll(subseq , requiredNodes(n , step + 1))

        if length(subseq) < curLen
            //mark all nodes of the old solution as unvisited
            for node n in shortest
                n.visited = MAX_VALUE

            curLen = length(subseq)
            shortest = subseq
        else
            //mark all nodes in this possible solution as unvisited
            //since they aren't used in the final solution (not at this place)
            for node n in subseq
                n.visited = MAX_VALUE

     for node n in shorest
         n.visited = step

     return shortest

The basic idea would be the following: Start from the starting node and search for all nodes that must be part of the solution (nodes that can be reached from the starting node by only traversing AND-relationships). Now for all of these nodes, the algorithm searches for the combination of optional nodes (OR) with the fewest nodes required.

NOTE: so far this algorithm isn't much better than brute-force. I'll update as soon as i've found a better approach.

But what happend if one node is shared as dependency by two others node? Take mine example: A have two or dependency, Y and H. The shortest path is A E B Y and not A E B H because A and E share the same dependency. In fact if if you choose H instead of Y your shortest path is A E B H Y. How can u handle this condition? Hope you understand me. — KLi, May 28 '15 at 13:13
you're right, my solution only works for trees. I'll fix that — Paul, May 28 '15 at 13:37

saka1029 · Answer 7 · 2015-06-01T22:04:17.713

My code is here.

Scenario:

Represent the constraints.

X : A&(E|C)
A : E&(Y|N)
E : B&(Z|Y)
C : A|K

Prepare two variables target and result. Add the node X to target.

target = X, result=[]

Add single node X to the result. Replace node X with its dependent in the target.

target = A&(E|C), result=[X]

Add single node A to result. Replace node A with its dependent in the target.

target = E&(Y|N)&(E|C), result=[X, A]

Single node E must be true. So (E|C) is always true. Remove it from the target.

target = E&(Y|N), result=[X, A]

Add single node E to result. Replace node E with its dependent in the target.

target = B&(Z|Y)&(Y|N), result=[X, A, E]

Add single node B to result. Replace node B with its dependent in the target.

target = (Z|Y)&(Y|N), result=[X, A, E, B]

There are no single nodes any more. Then expand the target expression.

target = Z&Y|Z&N|Y&Y|Y&N, result=[X, A, E, B]

Replace Y&Y to Y.

target = Z&Y|Z&N|Y|Y&N, result=[X, A, E, B]

Choose the term that has smallest number of nodes. Add all nodes in the term to the target.

target = , result=[X, A, E, B, Y]

This algorithm appears to be correct, but it is highly inefficient. The bottleneck is the expansion step. Even when an input is not so large (e.g. 100 different `(A|B)`s), the program will not only run forever (and this is [expected](http://stackoverflow.com/a/30543429/2266855)), but also consume [yottabytes](http://en.wikipedia.org/wiki/Yottabyte) of memory for the expanded `target`. — dened, Jun 03 '15 at 07:52
@dened Yes, you are right. The problem is to find the term that has smallest number of nodes. I think there are shortcut algorithms. At least, there are algorithms finding the better (may not the best) solution in polynomial time. It is only my intuition. — saka1029, Jun 03 '15 at 11:37

Gentian Kasa · Answer 8 · 2015-06-03T15:01:51.163

I would suggest you to first transform the graph in a AND-OR Tree. Once done you can perform a search in the tree for the best (where you can choose what "best" means: shortest, lowest memory occupation of packages in nodes, etc...) path.

A suggestion I'd make, being that the condition to install X would be something like install(X) = install(A) and (install(E) or install(C)), is to group the OR nodes (in this case: E and C) into a single node, say EC, and transform the condition in install(X) = install(A) and install(EC).

In alternative, based on the AND-OR Tree idea, you could create a custom AND-OR Graph using the grouping idea. In this way you could use an adaptation of a graph traversal algorithm, which could be more useful in certain scenarios.

Yet another solution could be to use Forward Chaining. You'd have to follow these steps:

Transform (just re-writing the conditions here):

A and (E or C) => X

E and (H or Y) => A

B and (Z or Y) => E

into

(A and E) or (A and C) => X
(E and H) or (E and Y) => A
(B and Z) or (B and Y) => E

Set X as goal.
Insert B, H, K, Y, Z as facts.
Run Forward chaining and stop on the first occurrence of X (the goal). That should be the shortest way to achieve the goal in this case (just remember to keep track of the facts that have been used).

Let me know if anything is unclear.

score 0 · Answer 9 · answered May 25 '15 at 00:33

0

This is an example of a Constraint Satisfaction Problem. There are Constraint Solvers for many languages, even some that can run on generic 3SAT engines, and thus be run on GPGPU.

answered May 25 '15 at 00:33

score 0 · Answer 10 · edited Jun 20 '20 at 09:12

Another (fun) way to solved this issue is to use a genetic algorithm.

Genetic Algorithm is powerful but you have to use a lot of parameters and find the better one.

Genetic Step are the following one :

a . Creation : a number of random individual, the first generation (for instance : 100)

b. mutation : mutate of low percent of them (for instance : 0,5%)

c. Rate : rate (also call fitness) all the individual.

d. Reproduction : select (using rates) pair of them and create child (for instance : 2 child)

e. Selection : select Parent and Child to create a new generation (for instance : keep 100 individual by generation)

f. Loop : Go back to step "a" and repeat all the process a number of time (for instance : 400 generation)

g. Pick : Select an individual of the last generation with a max rate. Individual will be your solution.

Here is what you have to decide :

Find a genetic code for your individual

You have to represent a possible solution (call individual) of your problem as a genetic code.

In your case, it could be a group of letter representing the node which respect constraint OR and NOT.

For instance :

[ A E B Y ], [ A C K H ], [A E Z B Y] ...

Find a way to rate individual

To know if an individual is a good solution, you have to rate it, in order to compare it to other individual.

In your case, it could be pretty easy : individual rate = number of node - number of individual node

For instance :

[ A E B Y ] = 8 - 4 = 4

[ A E Z B Y] = 8 - 5 = 3

[ A E B Y ] as a better rate than [ A E Z B Y ]

Selection

Thanks to individual's rate, we can select Pair of them for reproduction.

For instance by using Genetic Algorithm roulette wheel selection

Reproduction

Take a pair of individual an create some (for instance 2) child (other individual) from them.

For instance :

Take a node from the first one and swap it with a node of the second one.

Make some adjustment to fit "or, and" constraint.

[ A E B Y ], [ A C K H ] => [ A C E H B Y ], [ A E C K B Y]

Note : that this is not the good way to reproduct it because the child are worth than the parent. Maybe we can swap a range of node.

Mutation

You have just to change genetic code of select individual.

For instance :

Delete a node
Make some adjustment to fit "or, and" constraint.

As you can see, it's not hard to implements but a lot of choice has to be done for designing it with a specific issue and to control the different parameters (percent of mutation, rate system, reproduction system, number of individual, number of generation, ...)

Dependency Algorithm - find a minimum set of packages to install

10 Answers10

Linked