-1

I have the following mathematical and proggramming problem: I have a list of about 14000 items. I must chose 4 items so that a + b + c + d - pi = minimal error

going threw all options will take far too long time. I am supposed to build a program (i'm doing it with a python script) that will solve this problem

Any ideas?

Edit: If it helps, the items are ek/10000 - 1 for every k between 0 and about 14400 (that will give pi)

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
Gadol21
  • 96
  • 8

3 Answers3

4

This is a variation of Subset Sum Problem with fixed subset size. (You are facing the optimization problem).
The existence solution (Is there a subset that sums exactly to pi) is discussed thoroughly in the thread: Sum-subset with a fixed subset size.

In your problem (the optimization problem) - if you can repeat an element more than once - it is easily solveable in O(n2log) with O(n2) additional space as follows:

  1. Create an array of size O(n2) containing all possible sum of pairs. Let it be arr.
  2. Sort arr - O(n^2logn)
  3. For each element e of arr - binary search in arr to find the element closest to pi-e.
  4. Yield the two pairs that got the best result in step 3.

The complexity of step 3 is O(logn) per iteration, and n2 iterations - so O(n^2logn) total.

Community
  • 1
  • 1
amit
  • 166,614
  • 24
  • 210
  • 314
  • Would you please prove that this is a variation of the subset sum problem? Or at least give a reference to such a proof? – mok Mar 09 '14 at 14:20
  • @mok I think what amit meant is that it is an instance of subset sum – Niklas B. Mar 09 '14 at 14:38
  • @NiklasB: There is a tricky difference, the question says that f(a)+f(b)+... - pi should be minimized, so obviously there are some similarities but this small difference may cause a huge difference, and if there isn't such a difference it should be proved. – mok Mar 09 '14 at 14:41
  • @mok the optimization problem subset sum asks you to minimize the difference between the sun of a subset and a given value. It's not precisely this but it's close enough. It's definitely Knapsack, but that one is a lot more general – Niklas B. Mar 09 '14 at 14:46
  • @NiklasB For sure this is not a proof at all or even near to that (at least I don't prove Sth this way). look [6+75+89+226] is really far from the pi but f200(6) + f200(75) + f200(89) + f200(226) is the closest set to the pi! And I agree with some extent with what you said about the knapsack. We can consider it as a knapsack with capacity of pi. – mok Mar 09 '14 at 14:51
  • @mok [Quote from wikipedia](http://en.wikipedia.org/wiki/NP-hard): `If an optimization problem H has an NP-complete decision version L, then H is NP-hard.`. Note that the variation to subset-sum refered to the general problem (without fixed size of subset) which is NP-Hard, since the decision problem (subset sum) is NP-Complete. With subsets of size 4, there is O(n^4) solution, the problem is polynomial. – amit Mar 09 '14 at 15:55
  • It feels like there should be a way to sort the pairs in `O(n^2)`, but I can't figure it out. If we can do that, the whole algorithm would be `O(n^2)` because we can get rid of the binary search and replace it with a second pointer – Niklas B. Mar 09 '14 at 16:12
2

You could sort the list of items, which requires O(n log(n) ) time. Then, given a+b+c, you can find the best d using binary search.

You could also (possibly, depending on the data) cut the search tree by checking if the partial sum at any step is too large or too small to have a chance of becoming the correct solution.

By taking these two steps you should be able to reduce the runtime drastically.

riklund
  • 951
  • 1
  • 7
  • 15
  • 1
    all of this + making sure that the inner loops only run over the later indices like `for (int i=0; i – example Mar 09 '14 at 13:59
  • That will result in a cubic runtime which is too slow for this problem – Niklas B. Mar 09 '14 at 14:42
  • It will be O(n^3 log(n) ) which is indeed slower than the subset sum solution (given that element repetition is allowed). – riklund Mar 09 '14 at 15:01
  • @riklund You can adapt both algorithms for the cases where repetition is allowed and where repetition is not allowed and get the same runtime :) – Niklas B. Mar 09 '14 at 19:39
1

I suggest you to use Genetic Algorithms with these settings :

Chromosome : [a,b,c,d]

Fitness function: |f10000(a) + f10000(b) + f10000(c) + f10000(d) - π|

Crossover (Ch1,Ch2): Xover([a,b,c,d],[a',b',c',d']) -> [a,b,c',d'] , [a',b',c,d] *

Mutation (Ch) : Mutate([a,b,c,d]) -> [a,b',c,d] **

This problem is really easy for GA to solve and if you implement it you will find it solves the problem in a short time.
* Choose the crossover point randomly

** Replace one of the genes in the chromosome (randomly selected) with one of the possible points

Note that in this post I just gave the key points, however if you are not familiar with the GA at all, you could read about the whole topic and I will help you for more details.

mok
  • 6,612
  • 3
  • 34
  • 59
  • I love Genetic Algorithms, but I'm not as confident as you are, that this is a good use of them. Your crossover might work, but the mutation will most likely decrease fitness. There are a lot of information we have about the problem that is not represented in your choice of chromosomes, resulting in an artificially bloated search-space ([a,b,c,d] = [b,a,c,d] and other permutations). We can also solve a part of the problem trivially - and it is almost always a good idea with GAs to do so (sort the array and only store three numbers in the chromosomes, find the fourth with a binary search). – example Mar 09 '14 at 13:55
  • @example: I agree with you about the quality of this settings but I really don't know how much does he/she know about the GA at all and obviously I should keep the solution as simple as possible until he/she request for more or at least I get a feedback. BTW about representing the domain knowledge inside the genotype I'm agree with you that this is not the best choice (and I said about the reasons) but as you may know the genotype is not the only place to use the domain knowledge, specially you can take advantage of these knowledge in the fitness function. – mok Mar 09 '14 at 14:09