Need help with a word-packing algorithm

Question

I have a list of sub-lists of letters, where the number of letters in each sub-list can vary. The list and sub-lists are ordered. This structure can be used to produce words by choosing a number X, taking a letter from position X in every sub-list and concatenating them in order. If the number X is larger than the length of the sub-list, it would wrap around.

Given a set of words, I need to find a way to pack them into the smallest possible structure of this kind (i.e. with the shortest sub-lists). There would have to be as many sub-lists as the number of letter in the longest word, of course, and shorter words would be padded by blanks/spaces.

I am not a CS graduate so I apologize if the description of the problem is not entirely clear. To give a simple example: Suppose I have the words [ 'a ', 'an', 'if', 'is', 'in', 'on', 'of', 'i '] I need to pack, I could use the following structure:

[  
    [ 'i', 'o', 'a' ],  
    [ 's', 'n', 'f', ' ' ]  
]

This would enable me to produce the following words:

0: is  
1: on  
2: af*  
3: i  
4: os*  
5: an  
6: if  
7: o *  
8: as*  
9: in  
10: of  
11: a

If you take position 10, for example, the word 'of' is generated by concatenating the letter at index 10 % 3 (= 1) from the first sub-list, with the letter at index 10 % 4 (= 2) from the second sub-list.

My best attempt so far involves using a matrix of hamming distances to place the most-"connected" words first, and then their closest neighbors, with the goal of minimizing the change with every insertion. This was an entirely intuitive attempt and I feel like there has to be a better/smarter way to solve this.

Clarification

This is a practical problem I am trying to solve and the constraints are (roughly) as follows:
1. The number of characters per sub-list should be in the area of 100 or less.
2. The keyspace should be as small as possible (i.e. the number of spurious words should be minimal). Roughly, a keyspace in the millions of options is borderline.

I don't know that a good solution is even possible for this. With the algorithm I have right now, for example, I can insert about 200 words (just random English words) in a keyspace of 1.5 million options. I'd like to do better than that.

Do you need optimal solutions, or is a "good heuristic" enough? — Svante, Aug 17 '10 at 00:06
@Nikita: I came up with it... :) It's part of a project I'm working on. @Svante: As they say, the best is the enemy of the good. I'd be glad to hear any solution that improves on my current one. — szx, Aug 17 '10 at 01:04
Using the modulo function allows to create "spurious" words (those you marked with asterisks). How are you dealing with it? Does it matter how many spurious words do you create? — Dr. belisarius, Aug 17 '10 at 04:15
I don't mind spurious words... However, as I mentioned in a comment below, for practical reasons I need the distance between any words to be minimal. The more spurious/useless words I have, the less tightly packed the words I need are. — szx, Aug 17 '10 at 14:46
I don't understand. Do you mean distance in the keyspace or distance in the packed structure? I can pack 58,000 words into 464 characters with no unused charachters or charachters repeated within a sublist but the keyspace is gigantic. — aaronasterling, Aug 17 '10 at 18:45
just to clarify, gigantic keyspace = lots of spurious words. I do think that I could prove and am willing to conjecture that 'tightly packed structure' implies a 'large keyspace'. At least if you want to use congruences. — aaronasterling, Aug 17 '10 at 18:53
The average distance between two words in the keyspace should be minimal. If by the size of the structure you mean the number of characters in the sub-lists, my upper limit would be around a 100 characters per sub-list. Again, this is for practical reasons and might be impossible, but is it? — szx, Aug 17 '10 at 19:53
The problem is that without some fairly fancy mathematics (ring theory), the absolute lower bound on the keyspace is given as the product of the lengths of the sublists. Even this is only attainable under some fairly strict conditions on the lengths involved. You would be better off sidestepping the need to sort through the keyspace and going with a maximal packing which I can undelete if you want. — aaronasterling, Aug 17 '10 at 21:38

Nikita Rybak · Answer 1 · 2010-08-17T01:34:56.343

3

Well, you said you're interested in sub-optimal solutions, so I'll give you one. It depens on the alphabet size. For example, for 26 array size will be little over 100 (regardless of amount of words to encode).

It's well-known that if you have two different prime numbers a and b and non-negative integers k and l (k < a, l < b), you can find number n that n % a == k and n % b == l.
For example, with (a = 7, a = 13, k = 6, l = 3) you can take n = 7 * 13 + 7 * 3 + 13 * 6. n % 7 == 6 and n % 13 == 3

And same holds for any number of prime integers.

You can initialize arrays like this.

['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 29
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 31
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 37
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 41
...

Now, suppose your word is 'geek'. For it you need number X, such that X % 29 == 6, X % 31 == 4, X % 37 == 4, X % 41 == 10. And you can always find such X, as was shown above.

So, if you have alphabet of 26 letters, you can create matrix of width 149 (see the list of primes) and encode any word with it.

edited Aug 17 '10 at 01:34

answered Aug 17 '10 at 01:29

Nikita Rybak

64,889
22
150
172

Great answer, but it brings me to a constraint that I didn't specify since I don't have a well-defined guideline for a "good enough" solution: Given a set of a couple of hundreds of words, the average distance between any two words needs to be needs to be minimal. How minimal? Ideally, the number of different positions divided by the size of the first array would need to be in the scale of hundreds or thousands. With this solution, the number of possible positions escalates very quickly, becoming impractical for six letter (or more) words. – szx Aug 17 '10 at 02:42
@szx _Given a set of a couple of hundreds of words, the average distance between any two words needs to be needs to be minimal._ Can you clarify? I thought, we don't choose the set to encode: the set is given. – Nikita Rybak Aug 20 '10 at 21:17
See my post + clarification above: the set of words is given (and will contain approximately a couple of hundred words). These letters that produce these words can be arranged in different configurations (i.e. different indexes, number of letters per sub-list) of varying "efficiency". By "the distance between two words" I mean the number of spurious words that separate words from the given set, which should be minimal, making the signal to noise ratio maximal. – szx Aug 21 '10 at 16:57

aaronasterling · Answer 2 · 2010-08-20T21:59:29.097

We can improve upon Nikita Rybek`s answer by not actually making the lists a prime length but just associating a prime with the list. This allows us to not make the sub-lists any longer than necessary, hence keeping the primes smaller which implies a smaller keyspace and more efficient packing. Using this method and the code below, I packed a list of 58,110 (lowercase) words into 464 characters. It's interesting to note that only the letters 'alex' appear as the 21'st letter in a word. The keyspace was upwards of 33 digits however It is also not strictly necessary to use primes, the associated numbers just need to be coprime. This could probably be reduced.

import itertools
import operator
import math

# lifted from Alex Martelli's post at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python
def erat2( ):
    D = {  }
    yield 2
    for q in itertools.islice(itertools.count(3), 0, None, 2):
        p = D.pop(q, None)
        if p is None:
            D[q*q] = q
            yield q
        else:
            x = p + q
            while x in D or not (x&1):
                x += p
            D[x] = p


# taken from http://en.literateprograms.org/Extended_Euclidean_algorithm_(Python)
def eea(u, v):
    u1 = 1; u2 = 0; u3 = u
    v1 = 0; v2 = 1; v3 = v
    while v3 != 0:
        q = u3 / v3
        t1 = u1 - q * v1
        t2 = u2 - q * v2
        t3 = u3 - q * v3
        u1 = v1; u2 = v2; u3 = v3
        v1 = t1; v2 = t2; v3 = t3
    return u1, u2, u3

def assign_moduli(lists):
    used_primes = set([])
    unused_primes = set([])
    moduli = [0]*len(lists)
    list_lens = [len(lst) for lst in lists]
    for i, length in enumerate(list_lens):
        for p in erat2():
            if length <= p and p not in used_primes:
                used_primes.add(p)
                moduli[i] = p
                break
            elif p not in used_primes:
                unused_primes.add(p)
    return moduli



class WordEncoder(object):
    def __init__(self):
        self.lists = [[]] # the list of primedlists
        self.words = {} # keys are words, values are number that retrieves word
        self.moduli = [] # coprime moduli that are used to assign unique keys to words

    def add(self, new_words):
        added_letter = False # flag that we need to rebuild the keys
        for word in new_words:
            word = word.rstrip() # a trailing blank space could hide the need for a key rebuild
            word_length, lists_length = len(word), len(self.lists)
            # make sure we have enough lists
            if word_length > lists_length:
                self.lists.extend([' '] for i in xrange(word_length - lists_length))
            # make sure that each letter is in the appropriate list
            for i, c in enumerate(word):
                if c in self.lists[i]: continue
                self.lists[i].append(c)
                added_letter = True
            self.words[word] = None
        # now we recalculate all of the keys if necessary
        if not added_letter:
            return self.words
        else:
            self._calculate_keys()

    def _calculate_keys(self):
        # were going to be solving a lot of systems of congruences
        # these are all of the form x % self.lists[i].modulus == self.lists[i].index(word[i]) with word padded out to 
        # len(self.lists). We will be using the Chinese Remainder Theorem to do this. We can do a lot of the calculations
        # once before we enter the loop because the numbers that we need are related to self.lists[i].modulus and not
        # the indexes of the necessary letters
        self.moduli = assign_moduli(self.lists)
        N  = reduce(operator.mul, self.moduli)
        e_lst = []
        for n in self.moduli:
             r, s, dummy = eea(n, N/n)
             e_lst.append(s * N / n)
        lists_len = len(self.lists)
        #now we begin the actual recalculation 
        for word in self.words:
             word += ' ' * (lists_len - len(word))
             coords = [self.lists[i].index(c) for i,c in enumerate(word)]
             key = sum(a*e for a,e in zip(coords, e_lst)) % N  # this solves the system of congruences
             self.words[word.rstrip()] = key

class WordDecoder(object):
    def __init__(self, lists):
       self.lists = lists
       self.moduli = assign_moduli(lists)

    def decode(self, key):
        coords = [key % modulus for modulus in self.moduli]
        return ''.join(pl[i] for pl, i in zip(self.lists, coords))    


with open('/home/aaron/code/scratch/corncob_lowercase.txt') as f:
    wordlist = f.read().split()

encoder = WordEncoder()
encoder.add(wordlist)

decoder = WordDecoder(encoder.lists)

for word, key in encoder.words.iteritems():
    decoded = decoder.decode(key).rstrip()
    if word != decoded:
        print word, decoded, key
        print "max key size: {0}. moduli: {1}".format(reduce(operator.mul, encoder.moduli), encoder.moduli)
        break
else:
    print "it works"
    print "max key size: {0}".format(reduce(operator.mul, encoder.moduli))
    print "moduli: {0}".format(encoder.moduli)
    for i, l in enumerate(encoder.lists):
        print "list {0} length: {1}, {2} - \"{3}\"".format(i, len(l), encoder.moduli[i] - len(l), ''.join(sorted(l)))
    print "{0} words stored in {1} charachters".format(len(encoder.words), sum(len(l) for l in encoder.lists))

_but just associated a prime with the list_ But in szx's algorithm, number is divided by list length, not by another number associated with list. Do I get you right? — Nikita Rybak, Aug 20 '10 at 21:22
I fixed my post. I should have said 'just associating'. @szx didn't really provide an algorithm. I'm not sure what number you're referring to. — aaronasterling, Aug 20 '10 at 22:01
I quote. _"If the number X is larger than the length of the sub-list, it would wrap around"_ — Nikita Rybak, Aug 21 '10 at 14:24
Note, that in practice we don't need to store lists at all: we can easily determine character by the number and 'imaginary' index length without using additional memory. So, that makes the list of width 0 :) — Nikita Rybak, Aug 21 '10 at 14:26
@Nikita Rybeck, I would like to see that. As I understand it, that could only work if we assumed that every list was of the same size with the same contents which would just make the keyspace bigger. — aaronasterling, Aug 21 '10 at 20:23
@aaronsterling: unfortunately, a keyspace this big won't work. It'll be easier to understand why if I just describe the project: It's a physical art installation that consists of a gear train. Each gear corresponds to a sub-list, with the letters imprinted on the teeth. The goal is when rotated to certain positions, specific teeth (say, the ones pointing up on each gear) would make a word. By moving from position to position you can recreate a text (which hasn't been decided on yet, but will have several hundred unique words). There's a limit on the RPM, hence the keyspace issue. — szx, Aug 25 '10 at 14:23
Also, I tried the coprimes suggestion, but it didn't help much. — szx, Aug 25 '10 at 14:26
@szx. Unfortunately my books on algebra are all on a continent right now and I'm on an island. Talk to any mathematician that does even basic ring theory and they can at least point you in the right direction. — aaronasterling, Aug 25 '10 at 18:48

score 0 · Answer 3 · answered Aug 23 '10 at 16:42

I don't think I understand your problem completely, but I stumbled across prezip some time ago. Prezip is a way of compressing a sorted set of words by taking advantage of the fact that many words share a common prefix.

Since you're not refering to any sorting constraint, I would suggest creating a sorted set of words that you want. Then doing something similar to what prezip is doing. Result is a compressed and sorted set of words, to which you can refer to by index.

score 0 · Answer 4 · answered Aug 23 '10 at 16:52

0

I think you're looking for this http://en.wikipedia.org/wiki/Trie or this http://en.wikipedia.org/wiki/Radix_tree

Hope it helps.

answered Aug 23 '10 at 16:52

fortran

67,715
23
125
170

Jeez, is there any question tagged "algorithm" which don't get "trie" response? Looks like "trie" is new "jquery": can solve anything :) – Nikita Rybak Aug 25 '10 at 20:46
@Nikita He's trying to efficiently store words, and that's one of the things a Trie is for: http://en.wikipedia.org/wiki/Trie#Dictionary_representation – fortran Aug 26 '10 at 07:30
MySQL is also used to efficiently store words, I wonder why nobody offered it :) And LZW too! – Nikita Rybak Aug 26 '10 at 16:10
Thanks, I actually did look at tries but I haven't quite figured out a way to apply them to my problem. It's not just about storing words efficiently, there are unique constraints imposed by this problem that I'm not quite sure how to solve. – szx Aug 29 '10 at 12:14

Need help with a word-packing algorithm

Clarification

4 Answers4