9

I have been trying to port invRegex.py to a node.js implementation for a while, but I'm still struggling with it. I already have the regular expression parse tree thanks to the ret.js tokenizer and it works pretty well, but the actual generation and concatenation of all the distinct elements in a way that is memory-efficient is revealing very challenging to me. To keep it simple, lets say I have the following regex:

[01]{1,2}@[a-f]

Feeding that to invRegex.py produces the following output (tabbified to take less space):

 0@a     0@b     0@c     0@d     0@e     0@f
00@a    00@b    00@c    00@d    00@e    00@f
01@a    01@b    01@c    01@d    01@e    01@f
 1@a     1@b     1@c     1@d     1@e     1@f
10@a    10@b    10@c    10@d    10@e    10@f
11@a    11@b    11@c    11@d    11@e    11@f

Considering I'm able to get each individual token and produce an array of all the valid individual outputs:

[01]{1,2} = function () {
    return ['0', '00', '01', '1', '10', '11'];
};

@ = function () {
    return ['@'];
};

[a-f] = function () {
    return ['a', 'b', 'c', 'd', 'e', 'f'];
};

I can compute the cartesian product of all the arrays and get the same expected output:

var _ = require('underscore');

function cartesianProductOf() {
    return _.reduce(arguments, function(a, b) {
        return _.flatten(_.map(a, function(x) {
            return _.map(b, function(y) {
                return x.concat([y]);
            });
        }), true);
    }, [ [] ]);
};

var tokens = [
    ['0', '00', '01', '1', '10', '11'],
    ['@'],
    ['a', 'b', 'c', 'd', 'e', 'f'],
];

var result = cartesianProductOf(tokens[0], tokens[1], tokens[2]);

_.each(result, function (value, key) {
    console.log(value.join(''));
});

The problem with this is that it holds all the 36 values in memory, if I had a slightly more complicated regular expression, such as [a-z]{0,10} it would hold 146813779479511 values in memory, which is totally unfeasible. I would like to process this huge list in an asynchronous fashion, passing each generated combination to a callback and allowing me to interrupt the process at any sensible point I see fit, much like invRegex.py or this Haskell package - unfortunately I can't understand Haskell and I don't know how to mimic the generator behavior in Python to Javascript either.

I tried running a couple of simple generator experiments in node 0.11.9 (with --harmony) like this one:

function* alpha() {
    yield 'a'; yield 'b'; yield 'c';
}

function* numeric() {
    yield '0'; yield '1';
}

function* alphanumeric() {
    yield* alpha() + numeric(); // what's the diff between yield and yield*?
}

for (var i of alphanumeric()) {
    console.log(i);
}

Needless to say the above doesn't work. =/

Banging my head against the wall here, so any help tackling this problem would be highly appreciated.


UPDATE: Here is a sample ret.js parse tree for b[a-z]{3}:

{
    "type": ret.types.ROOT,
    "stack": [
            {
                "type": ret.types.CHAR,
                "value": 98 // b
            },
            {
                "type": ret.types.REPETITION,
                "max": 3,
                "min": 3,
                "value": {
                    "type": ret.types.SET,
                    "not": false,
                    "set": [
                        {
                            "type": ret.types.RANGE,
                            "from": 97, // a
                            "to": 122   // z
                        }
                    ]
                }
            }
        ]
    ]
}

The SET / RANGE type should yield 26 distinct values, and the parent REPETITION type should take that previous value to the power of 3, yielding 17576 distinct combinations. If I was to generate a flattened out tokens array like I did before for cartesianProductOf, the intermediate flattened values would take as much space as the actual cartesian product itself.

I hope this example explains better the problem I am facing.

Community
  • 1
  • 1
Alix Axel
  • 141,486
  • 84
  • 375
  • 483
  • If it helps to understand the ret.js parse tree structure, I've coded [a recursive function](https://gist.github.com/alixaxel/8745c98740837def2285) that calculates the number of valid return values. – Alix Axel Dec 28 '13 at 14:27
  • `yield*` is like Python's `yield from`. Also "I don't know how to mimic the generator behavior in Python to Javascript either." what behavior specifically? – Benjamin Gruenbaum Dec 28 '13 at 20:00
  • @BenjaminGruenbaum: It's still not very clear to me what `yield from` is exactly but from what I understood it's a way for a generator to pipe it's state methods to the inner iterators / generators; is this right? Tried it in Node with the example above and it throws an error, I suspect it's because the concatenation operator doesn't make sense there but I'm not sure. With plain `yield` the `console.log` outputs a single `[object Generator][object Generator]` string and not the actual values. – Alix Axel Dec 29 '13 at 00:10
  • @BenjaminGruenbaum: As for the behavior in Python, well, basically concatenating a single generator value with all the remaining generator values (without terminating prematurely any generator in the process). The Python code starts at `GroupEmitter.groupGen()` as a generator itself, but it also seems that this generator is creating / returning other generators inside it. I don't know how to do that - I can't even get the two generators above (`alpha` & `numeric`) that have the same number of generateable elements to return all 9 possible combinations. – Alix Axel Dec 29 '13 at 00:15
  • Alix, do you try to use a higher level promise library like Q? take a look in https://github.com/kriskowal/q – nico Dec 30 '13 at 15:59
  • @nico: I've looked at [co](https://github.com/visionmedia/co) and [wu](http://fitzgen.github.io/wu.js/) as modules worth investigating, but I'm a bit reluctant to pick either because I'm worried about the overhead it may bring. I'm looking for something that is memory efficient first and speedy second (trying to generate a couple thousand combinations in a couple of milliseconds). – Alix Axel Dec 30 '13 at 16:29
  • @nico: Also, I found out how to make [generators that yield other generators](http://stackoverflow.com/q/20834113/89771) natively in Node.js, the problem I'm facing now is how to combine these generators in a descendant manner (like the `cartesianProductOf` function) in order to return multiple concatenated results instead of just one - does that make sense? – Alix Axel Dec 30 '13 at 16:55
  • Yes you can, I see that you found your answer on the other post, please post the results about how the cartesian product performs when you're done :) – nico Dec 30 '13 at 20:13
  • Maybe the [algorithmic definition of `yield *`](http://wiki.ecmascript.org/doku.php?id=harmony:generators#delegating_yield) will help clarify how it works? To get your alphanumeric to work, you need to assign the result of the `yield`s then concatenate. Ex: `var x,y; x = yield* alpha(); y = yield* numeric(); return x + y;` – bishop Dec 31 '13 at 16:35
  • 1
    Or, you could use parentheses to clarify the `yield*` bind: `yield (yield* alpha()) + (yield* numeric());` – bishop Dec 31 '13 at 16:45
  • @bishop: Strangely enough, the outter yield doesn't seem to concatenate the `x` and `y` values, I end up with `a b c 0 1 NaN`. The good thing is that it works at least. – Alix Axel Dec 31 '13 at 23:29

6 Answers6

5

I advise you to write iterator classes. They are easy to implement (basically they are state machines), they have a low memory footprint, they can be combined to build up increasingly complex expressions (please scroll down to see the final result), and the resulting iterator can be wrapped in an enumerator.

Each iterator class has the following methods:

  • first: initializes the state machine (first match)
  • next: proceeds to the next state (next match)
  • ok: initially true, but becomes false once 'next' proceeds beyond the last match
  • get: returns the current match (as a string)
  • clone: clones the object; essential for repetition, because each instance needs its own state

Start off with the most trivial case: a sequence of one or more characters that should be matched literally (e.g. /foo/). Needless to say this has only one match, so 'ok' will become false upon the first call of 'next'.

function Literal(literal) { this.literal = literal; }

Literal.prototype.first = function() { this.i = 0; };
Literal.prototype.next = function() { this.i++; };
Literal.prototype.ok = function() { return this.i == 0; };
Literal.prototype.get = function() { return this.literal; };
Literal.prototype.clone = function() { return new Literal(this.literal); };

Character classes ([abc]) are trivial too. The constructor accepts a string of characters; if you prefer arrays, that's easy to fix.

function CharacterClass(chars) { this.chars = chars; }

CharacterClass.prototype.first = function() { this.i = 0; };
CharacterClass.prototype.next = function() { this.i++; };
CharacterClass.prototype.ok = function() { return this.i < this.chars.length; };
CharacterClass.prototype.get = function() { return this.chars.charAt(this.i); };
CharacterClass.prototype.clone = function() { return new CharacterClass(this.chars); };

Now we need iterators that combine other iterators to form more complex regular expressions. A sequence is just two or more patterns in a row (like foo[abc]).

function Sequence(iterators) {
   if (arguments.length > 0) {
      this.iterators = iterators.length ? iterators : [new Literal('')];
   }
}
Sequence.prototype.first = function() {
   for (var i in this.iterators) this.iterators[i].first();
};
Sequence.prototype.next = function() {
   if (this.ok()) {
      var i = this.iterators.length;
      while (this.iterators[--i].next(), i > 0 && !this.iterators[i].ok()) {
         this.iterators[i].first();
      }
   }
};
Sequence.prototype.ok = function() {
   return this.iterators[0].ok();
};
Sequence.prototype.get = function() {
   var retval = '';
   for (var i in this.iterators) {
      retval += this.iterators[i].get();
   }
   return retval;
};
Sequence.prototype.clone = function() {
   return new Sequence(this.iterators.map(function(it) { return it.clone(); }));
};

Another way to combine iterators is the choice (a.k.a. alternatives), e.g. foo|bar.

function Choice(iterators) { this.iterators = iterators; }

Choice.prototype.first = function() {
   this.count = 0;
   for (var i in this.iterators) this.iterators[i].first();
};
Choice.prototype.next = function() {
   if (this.ok()) {
      this.iterators[this.count].next();
      while (this.ok() && !this.iterators[this.count].ok()) this.count++;
   }
};
Choice.prototype.ok = function() {
   return this.count < this.iterators.length;
};
Choice.prototype.get = function() {
   return this.iterators[this.count].get();
};
Choice.prototype.clone = function() {
   return new Choice(this.iterators.map(function(it) { return it.clone(); }));
};

Other regex features can be implemented by combining the existing classes. Class inheritance is a great way to do this. For example, an optional pattern (x?) is just a choice between the empty string and x.

function Optional(iterator) {
   if (arguments.length > 0) {
      Choice.call(this, [new Literal(''), iterator]);
   }
}
Optional.prototype = new Choice();

Repetition (x{n,m}) is a combination of Sequence and Optional. Because I have to inherit one or the other, my implementation consists of two mutually dependent classes.

function RepeatFromZero(maxTimes, iterator) {
   if (arguments.length > 0) {
      Optional.call(this, new Repeat(1, maxTimes, iterator));
   }
}
RepeatFromZero.prototype = new Optional();

function Repeat(minTimes, maxTimes, iterator) {
   if (arguments.length > 0) {
      var sequence = [];
      for (var i = 0; i < minTimes; i++) {
         sequence.push(iterator.clone());   // need to clone the iterator
      }
      if (minTimes < maxTimes) {
         sequence.push(new RepeatFromZero(maxTimes - minTimes, iterator));
      }
      Sequence.call(this, sequence);
   }
}
Repeat.prototype = new Sequence();

As I said earlier, an iterator can be wrapped into an enumerator. This is simply a loop that you can break whenever you want.

function Enumerator(iterator) {
   this.iterator = iterator;

   this.each = function(callback) {
      for (this.iterator.first(); this.iterator.ok(); this.iterator.next()) {
         callback(this.iterator.get());
      }
   };
}

Time to put it all together. Let's take some silly regular expression:

([ab]{2}){1,2}|[cd](f|ef{0,2}e)

Composing the iterator object is really straightforward:

function GetIterationsAsHtml() {

   var iterator = new Choice([
      new Repeat(1, 2,
         new Repeat(2, 2, new CharacterClass('ab'))),
      new Sequence([
         new CharacterClass('cd'),
         new Choice([
            new Literal('f'),
            new Sequence([
               new Literal('e'),
               new RepeatFromZero(2, new Literal('f')),
               new Literal('e')
            ])
         ])
      ])
   ]);

   var iterations = '<ol>\n';
   var enumerator = new Enumerator(iterator);
   enumerator.each(function(iteration) { iterations += '<li>' + iteration + '</li>\n'; });
   return iterations + '</ol>';
}

This yields 28 matches, but I will spare you the output.

My apologies if my code is not compliant to software patterns, is not browser-compatible (works OK on Chrome and Firefox) or suffers from poor OOP. I just hope it makes the concept clear.

EDIT: for completeness, and following OP's initiative, I implemented one more iterator class: the reference.

A reference (\1 \2 etc) picks up the current match of an earlier capturing group (i.e. anything in parentheses). Its implementation is very similar to Literal, in that it has exactly one match.

function Reference(iterator) { this.iterator = iterator; }

Reference.prototype.first = function() { this.i = 0; };
Reference.prototype.next  = function() { this.i++; };
Reference.prototype.ok    = function() { return this.i == 0; };
Reference.prototype.get   = function() { return this.iterator.get(); };
Reference.prototype.clone = function() { return new Reference(this.iterator); };

The constructor is given an iterator that represents the referenced subpattern. Taking (foo|bar)([xy])\2\1 as an example (yields fooxxfoo, fooyyfoo, barxxbar, baryybar):

var groups = new Array();

var iterator = new Sequence([
   groups[1] = new Choice([new Literal('foo'), new Literal('bar')]),
   groups[2] = new CharacterClass('xy'),
   new Reference(groups[2]),
   new Reference(groups[1])
]);

Capturing groups are specified as you build up the tree of iterator classes. I am still doing that manually here, but eventually you want this to be automated. That is just a matter of mapping your parse tree to a similar tree of iterator classes.

EDIT 2: here's a relatively simple recursive function that will convert a parse tree produced by ret.js into an iterator.

function ParseTreeMapper() {
    this.capturingGroups = [];
}
ParseTreeMapper.prototype.mapToIterator = function(parseTree) {
    switch (parseTree.type) {
        case ret.types.ROOT:
        case ret.types.GROUP:
            var me = this;
            var mapToSequence = function(parseTrees) {
                return new Sequence(parseTrees.map(function(t) {
                    return me.mapToIterator(t);
                }));
            };
            var group = parseTree.options ?
                new Choice(parseTree.options.map(mapToSequence)) : 
                mapToSequence(parseTree.stack);
            if (parseTree.remember) {
                this.capturingGroups.push(group);
            }
            return group;
        case ret.types.SET:
            return new CharacterClass(this.mapToCharacterClass(parseTree.set));
        case ret.types.REPETITION:
            return new Repeat(parseInt(parseTree.min), parseInt(parseTree.max), this.mapToIterator(parseTree.value));
        case ret.types.REFERENCE:
            var ref = parseInt(parseTree.value) - 1;
            return ref in this.capturingGroups ?
                new Reference(this.capturingGroups[ref]) :
                new Literal('<ReferenceOutOfRange>');
        case ret.types.CHAR:
            return new Literal(String.fromCharCode(parseTree.value));
        default:
            return new Literal('<UnsupportedType>');
    }
};
ParseTreeMapper.prototype.mapToCharacterClass = function(parseTrees) {
    var chars = '';
    for (var i in parseTrees) {
        var tree = parseTrees[i];
        switch (tree.type) {
            case ret.types.CHAR:
                chars += String.fromCharCode(tree.value);
                break;
            case ret.types.RANGE:
                for (var code = tree.from; code <= tree.to; code++) {
                    chars += String.fromCharCode(code);
                }
                break;
        }
    }
    return chars;
};

Usage:

var regex = 'b[a-n]{3}';
var parseTree = ret(regex);    // requires ret.js
var iterator = new ParseTreeMapper().mapToIterator(parseTree);

I put all components together in this demo: http://jsfiddle.net/Pmnwk/3/

Note: many regex syntax constructs are not supported (anchors, look-ahead, look-behind, recursion), but I guess it is already pretty much up to par with invRegex.py.

Ruud Helderman
  • 9,064
  • 1
  • 19
  • 39
  • Thanks for the exhaustive answer @Ruud, you deserve a bounty just for the effort! One question: do you have any idea how your implementation compares to [this one](https://gist.github.com/alixaxel/8f623f3e719dd9bec1d3#file-generate-js) in terms of performance (specially memory-wise)? – Alix Axel Jan 02 '14 at 01:09
  • 1
    @AlixAxel: Like any state machine, memory usage is constant (NOT proportional to the number of matches) because the iterator classes do not store matches; each instance just holds a simple counter to keep track of its progress. Please note that my test function (GetIterationsAsHtml) is NOT memory-optimized (it accumulates all matches into one big string 'iterations'). As for an approach with 'yield', I expect that to be equally economical in terms of memory usage. Yield has the benefit of keeping the code inside your iterator classes more readable, but browser support may be an issue. – Ruud Helderman Jan 02 '14 at 10:36
  • Outstanding! It even returns repetitions in lexical order -- fantastic work @Ruud, thank you so much. – Alix Axel Jan 04 '14 at 01:00
  • How do you write iterator classes ? perhaps a book/online resource to show the basics. Is it better to write your own classes or is there a library to use this ? – user568109 Jan 07 '14 at 09:57
  • 1
    Some programming languages have a predefined 'iterator' interface (http://en.wikipedia.org/wiki/Iterator_pattern ), but due to its dynamic semantics, javascript requires no formal interface definition. There may be libraries for typical implementations, e.g. iterators as wrappers for arrays/dictionaries, but so far I have found none. My particular case was pretty specific, so I wrote it all from scratch, which wasn't that hard; my answer contains the complete implementation. To learn more, GoF (http://en.wikipedia.org/wiki/Design_Patterns_(book) ) has in-depth coverage of the iterator pattern. – Ruud Helderman Jan 07 '14 at 11:00
  • @Ruud: Could you write a little code that takes the ret.js parse tree and composes the iterator object per your example? – Alix Axel Jan 07 '14 at 12:51
  • @AlixAxel: Working on that... for a sneak preview, see http://jsfiddle.net/Pmnwk/1/ – Ruud Helderman Jan 08 '14 at 19:38
  • @AlixAxel: see *edit 2* in my answer above. I also refactored class `Sequence`, making it robust against zero-length sequences (originally, these would produce an infinite repetition). – Ruud Helderman Jan 11 '14 at 18:23
  • 1
    @user568109: I did find two javascript libraries after all: [wu.js](http://fitzgen.github.io/wu.js/), [goog.iter](http://docs.closure-library.googlecode.com/git/closure_goog_iter_iter.js.html) – Ruud Helderman Jan 11 '14 at 18:44
  • @Ruud: Sorry for the long but I could only take a closer look at this now. Awesome work, I'm starting another bounty to reward your effort. Thank you. =) – Alix Axel Jan 17 '14 at 01:45
  • @AlixAxel: Thank you, most generous! If you need any additional help, you know where to find me. – Ruud Helderman Jan 17 '14 at 22:20
  • @Ruud: Actually, could you refactor or explain what 'this.iterators[--i].next(), i > 0 && !this.iterators[i].ok()' in 'Sequence.prototype.next()' is doing? The reason I'm asking is because I'm trying to port your code to Go (node.js is slower than what I was expecting) and I don't know what the statement before the comma is doing. – Alix Axel Jan 18 '14 at 07:47
  • 1
    Ah yes, _comma_ operator and inlined decrement were abandoned in _Go_. Anyway, my _while_ loop is equivalent with `i--; this.iterators[i].next(); while (i > 0 && !this.iterators[i].ok()) { this.iterators[i].first(); i--; this.iterators[i].next(); }` PS for further _Go_ issues, you may want to open up a new question, keeping the current thread dedicated to javascript. – Ruud Helderman Jan 18 '14 at 21:16
  • @Ruud: Yeah, I just wanted to understand the comma operator there. =) – Alix Axel Jan 23 '14 at 05:40
2

Here's a version that makes a function for each part of the input and composes all of them to produce a function that'll generate each regex result and feed it into that argument:

//Takes in a list of things, returns a function that takes a function and applies it to
// each Cartesian product. then composes all of the functions to create an
// inverse regex generator.

function CartesianProductOf() {
    var args = arguments;
    return function(callback) {
        Array.prototype.map.call(args, function(vals) {
            return function(c, val) {
                vals.forEach(function(v) {
                    c(val + v);
                });
            };
        }).reduce(function(prev, cur) {
            return function(c, val) {
                prev(function(v){cur(c, v)}, val);
            };
        })(callback, "");
    };
}      

Modified to work off a parse tree (copied a litte code from here):

//Takes in a list of things, returns a function that takes a function and applies it to
// each Cartesian product.

function CartesianProductOf(tree) {
    var args = (tree.type == ret.types.ROOT)? tree.stack :
                ((tree.type == ret.types.SET)? tree.set : []);

    return function(callback) {
        var funs = args.map(function(vals) {
            switch(vals.type) {
                case ret.types.CHAR:
                    return function(c, val) {
                        c(val + vals.value);
                    };
                case ret.types.RANGE:
                    return function(c, val) {
                        for(var i=vals.from; i<=vals.to; i++) {
                            c(val+String.fromCharCode(i));
                        }
                    };
                case ret.types.SET:
                     return function(c, val) {
                         CartesianProductOf(vals)(function(i) {c(val+i)});
                     };
/*                   return function(c, val) {
                        vals.set.forEach(function(v) {
                            c(val + v);
                        });
                    };        */
                case ret.types.REPETITION:
                    var tmp = CartesianProductOf(vals.value);

                    if(vals.max == vals.min) {
                        return fillArray(function(c, val) {
                            tmp(function(i){c(val+i);}); //Probably works?
                        }, vals.max);
                    } else {
                        return fillArray(function(c, val) {
                            tmp(function(i){c(val+i);});
                        }, vals.min).concat(fillArray(function(c, val) {
                            c(val);
                            tmp(function(i){c(val+i);});
                        }, vals.max-vals.min));
                    }
                default:
                    return function(c, val) {
                        c(val);
                    };
            }
        }).reduce(function(prev, cur) { //Flatten array.
            return prev.concat(cur);
        }, []);

        if(tree.type == rets.type.ROOT) //If it's a full tree combine all the functions.
            funs.reduce(function(prev, cur) { //Compose!
                return function(c, val) {
                    prev(function(v){cur(c, v)}, val);
                };
            })(callback, "");
        else                          //If it's a set call each function.
            funs.forEach(function(f) {f(callback, "")}); 
    };
}

function fillArray(value, len) {
    var arr = [];
    for (var i = 0; i < len; i++) {
        arr.push(value);
    }
    return arr;
}

If you're alright with a less functionalish, more C-esque solution:

function helper(callme, cur, stuff, pos) {
    if(pos == stuff.length) {
        callme(cur);
    } else 
        for(var i=0; i<stuff[pos].length; i++) {
            helper(callme, cur+stuff[pos][i], stuff, pos+1);
        }
}

function CartesianProductOf(callback) {
    helper(callback, "", Array.prototype.slice.call(arguments, 1), 0);
}
Community
  • 1
  • 1
cactus1
  • 609
  • 5
  • 8
  • The first snippet works, but the other two don't. How easy would it be to make the first `helper` function traverse the tokens in a descending direction (as opposed to flattened out)? I'm gonna review my code and see if I can change it to make it work that way, either way - I'll post additional feedback soon. Thanks BTW. – Alix Axel Dec 31 '13 at 06:50
  • Not flattened as in like "[[[[["a", "b"], "c", "d"], ["e", 'f"]]]", where you'd end up with "ace", "bce", "ade", "dbe", "acf", "bcf", etc..? – cactus1 Dec 31 '13 at 08:16
  • Or would you end up with "ae", "af", "be", "bf", "ce", "cf", "de", "df"? – cactus1 Dec 31 '13 at 08:23
  • Something like `[['0' -> ['', '0', '1'], '1' -> ['', '0', '1']], '@', ['a', 'b', 'c', 'd', 'e', 'f']]` (where `->` represents a parent/child nesting level) would be more representative. The parse tree comes from ret.js, you can see a simple output [in here](https://github.com/fent/ret.js/issues/1). – Alix Axel Dec 31 '13 at 09:11
  • How's the second sample look? You can add in more cases in the switch statement to extend it to work with other types. – cactus1 Dec 31 '13 at 14:11
  • The second example looks promising, I'm gonna expand it to work with additional token types and see how it performs. Am I right in thinking that each token type values will have to be computed *n* times, where *n* is the number of total parent token combinations? – Alix Axel Dec 31 '13 at 23:37
  • Um, could you be more specific about what you mean with "token type values"? – cactus1 Jan 01 '14 at 00:27
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/44256/discussion-between-cactus1-and-alix-axel) – cactus1 Jan 01 '14 at 00:44
  • I am mostly concerned with the REPETITION type. It's pretty obvious that, for instance, I have to generate a `a..z` array for a RANGE type - but if that token is enclosed in another `{5}` REPETITION token, will I need to generate all the possible 36^5 (60466176) values and hold them in an array as well or is there a better way? Also important: if this token is iterated more than once, will this massive generation occur every time? Sorry if this is not clear enough, I'm trying to cook an example to show you (also take a look at my updated question). – Alix Axel Jan 01 '14 at 15:00
  • Nope, you can just create an array of five functions that produce the proper range output for that token, and then just flatten that before you get to the reduce() step. Like [abc]e[0-9]{3}f => [function(), function(), [function(), function(), function()], function()]. So repetition can be as efficient as *n*-number references to the same token function. – cactus1 Jan 01 '14 at 15:27
  • Also, you don't actually need to create an array for the RANGE type, you can just create a for loop that'll iterate over the proper range of characters – cactus1 Jan 01 '14 at 15:29
  • Fixed a couple of parse errors in it, still getting a `TypeError: Reduce of empty array with no initial value`. – Alix Axel Jan 01 '14 at 16:12
  • I just realized the set case isn't working correctly; it'll try to concatenate all the cases instead of running the input function on each element. Gonna have to change the program flow a bit to use a forEach when the input's a set and the current reduce when it's not. What's the input case that's causing it? The same as in the question? – cactus1 Jan 01 '14 at 16:18
  • Right now trying it with `CartesianProductOf(ret('@a{3}'))(console.log);`. – Alix Axel Jan 01 '14 at 16:19
  • Ah, I see the problem, it's because of the first line of the function that doesn't work when it's given a repetition token. Recursion's proving tricky.. – cactus1 Jan 01 '14 at 16:20
  • Ah, no, it's because the tree's under tree.options instead of tree.stack, so it couldn't find it. Haven't fixed the code because I'm not sure when it'll be under stack and when it'll be under options. Also, wanna move this to a gist or something? I feel like this isn't using SO correctly. – cactus1 Jan 01 '14 at 16:37
  • Sorry for the long, I've finally managed to work out a [implementation with generators](https://gist.github.com/alixaxel/8f623f3e719dd9bec1d3#file-generate-js), there's still some rough edges and a nasty bug with the capturing references but everything else is working. One thing that I would like to change is the order in which REPETITION yield results: right now it's returning `'a' 'b' 'aa' 'ab' 'ba' 'bb'`, and I would like it to return in lexicographic order (`'a' 'aa' 'ab' 'b' 'ba' 'bb'`) - do you have any ideas how to do that? – Alix Axel Jan 02 '14 at 01:05
1

How about this:

var tokens = [
    ['0', '00', '01', '1', '10', '11'],
    ['@'],
    ['a', 'b', 'c', 'd', 'e', 'f'],
];

function cartesianProductOf(arr, callback) {
  var cur = [];
  var f = function(i) {
    if (i >= arr.length) {
      callback(cur.join(''));
      return
    }
    for (var j=0; j<arr[i].length; j++) {
      cur[i] = arr[i][j];
      f(i+1);
    }
  };
  f(0);
}

cartesianProductOf(tokens, function(str) { console.log(str); });
Caleb
  • 8,346
  • 30
  • 27
  • Very nice and clean, just have one question: could this be made to work with a n-dimensional array [as seen here](http://stackoverflow.com/q/20834113/89771) (also just [like invPython](https://code.google.com/p/pythonxy/source/browse/src/python/_pyparsing/DOC/examples/invRegex.py#50) does it)? I'm going to hack a bit on my code and I'll get back to you, but it would be awesome if I didn't had to flatten the token array beforehand. – Alix Axel Dec 31 '13 at 06:54
1

It sounds like you're asking for Lazy Cartesian Product: you do want the Cartesian product, but you don't want to compute them all beforehand (and consume all that memory). Said another way, you want to iterate through the Cartesian product.

If that's right, have you checked out this Javascript implementation of the X(n) formula? With that, you can either iterate over them in natural order <<0,0,0>, <0,0,1>, <0,1,0>, ...> or choose an arbitrary position to calculate.

It seems like you can just do:

// move through them in natural order, without precomputation
lazyProduct(tokens, function (token) { console.log(token); });

// or...
// pick ones out at random
var LP = new LazyProduct(tokens);
console.log(LP.item(7));   // the 7th from the list, without precompute
console.log(LP.item(242)); // the 242nd from the list, again without precompute

Surely I must be missing something...? Generators are simply overkill given the X(n) formula.

Update
Into a JSFiddle I have placed an instrumented version of the lazyProduct code, a sample tokens array of arrays, and a call to lazyProduct with those tokens.

When you run the code without modification, you'll see it generates the 0@a, etc. output expected from the sample tokens array. I think the link explains the logic pretty well, but in summary... if you uncomment the instrumentation in lazyProduct, you'll notice that there are two key variables lens and p. lens is a pre-compute of the length of each array (in the array of arrays) passed in. p is a stack that holds the current path up to where you are (eg, if you're "1st array 3rd index, 2nd array 2nd index, and 3rd array 1st index" p represents that), and this is what's passed into your callback function.

My callback function just does a join on the arguments (per your OP example), but again these are just the corresponding values in p mapped to the original array of arrays.

If you profile this further, you'll see the footprint needed to build the Cartesian product is limited to just what you need to call your callback function. Try it on one of your worst case tokens to see.

Update 2
I coded about 75% of an approach based on Cartesian product. My API took a ret.js parse tree, converted that to RPN, then generated sets of sets to pass into a X(n) calculator. Using @ruud example of ([ab]{2}){1,2}|[cd](f|ef{0,2}e), this would be generated:

new Option([
    new Set([
        new Set(new Set(['a','b']), new Set['a','b'])),
        new Set(new Set(['a','b']), new Set['a','b']))
    ]),
    new Set(
        ['c', 'd'],
        new Option([
            new Set(['f']),
            new Set(['e']]),
            new Set(['e','f']),
            new Set(new Set(['e','f']), new Set(['e', 'f']))
        ])
])

The tricky parts were nested options (buggy) and inverse character classes & back-references (unsupported).

This approach was getting brittle, and really the Iterator solution is superior. Converting from your parse tree into that should be pretty straightforward. Thanks for the interesting problem!

bishop
  • 32,403
  • 9
  • 89
  • 122
  • This reminds of [factoradics](http://en.wikipedia.org/wiki/Factorial_number_system), and it may well be exactly what I need. Let me play with it for a bit and I'll get back to you. Thanks, very interesting site you linked to. =) – Alix Axel Dec 31 '13 at 23:24
  • I've been staring at the code for a while trying to understand how it would fit my needs, but given a regex like `[a-z]{10}` I would still need to precompute and hold in memory 26^10 values, isn't that right? Only after I would be able to pass that on to `lazyProduct` - or is there another way I am missing? – Alix Axel Jan 01 '14 at 00:19
  • Nope, the only precomputation is the calculation of the length of the arrays (in the array of arrays). See my update for further details. – bishop Jan 01 '14 at 03:13
  • I guess what I'm trying to say is that since ret.js produces a n-dimensioal parse tree structure I would need to flatten it to a bi-dimensional array first in order to use this method. The problem of flattening it first is that I would effectively be generating nearly all the possible combinations and that would take a huge amount of memory. I guess my question is a little misleading when I mention the `tokens` array: it is possible to do it that way, but it isn't feasible in terms of memory consumption. Does that make sense? I'll try to post an example of the ret.js parse tree in a bit. – Alix Axel Jan 01 '14 at 13:36
  • Here's the [ret.js parse tree for `a|b[a-z]{3}`](http://pastie.org/8591592). If I was going to flatten and expand it it would to a bi-dimensional array, it would look like `[['a', 'baaa', ..., 'bzzz']]` and contain 17577 elements (effectively all of the generateable elements). Something like `[['a'], ['b'], ['a', ..., 'z'], ['a', ..., 'z'], ['a', ..., 'z']]` would be way shorter but it would yield invalid outputs (those prefixed with an `a`). – Alix Axel Jan 01 '14 at 14:36
  • Ok, I get that the ret.js parse tree is not `tokens`. But I am not seeing that as the issue in the OP. You say: "I'm able to get each individual token and produce an array of all the valid individual outputs... I can compute the Cartesian product... [t]he problem being [the Cartesian Product] holds all values in memory... I would like to [pass] each generated combination to a callback." I'm pretty sure Lazy Cartesian Product solves that problem, the OP. Now it sounds like there's a new problem -- is that a new question or an update -- not sure of protocol when there's a bounty in play? – bishop Jan 01 '14 at 14:58
  • It came out wrong then, sorry. Basically I meant to say that I *could* do it that way, but that I *shouldn't* because it's not efficient in terms of memory consumption: if you take the previous example I gave into consideration, flattening out the array would produce just as many array elements as the total lazy evaluation would (the only difference being that I would save 1 byte on 17576 elements), which is neglectable given the whole context. As for the bounty, I'm happy to start another one... – Alix Axel Jan 01 '14 at 15:09
  • So, in simplest terms, is your question: "I have a ret.js parse tree. I want to generate all possible resulting strings, but I do not want to store the entire Cartesian product. Solution must use Javascript that's valid for node.js v0.11.9." ??? – bishop Jan 01 '14 at 15:13
  • Yes, that is it (didn't wanted to clutter the question with ret.js specifics). I'm not expecting a complete solution either, originally I was looking for an example involving generators in a way similar to how invRegex.py does it, but like you said before I'm not sure if generators are the best way to go now. – Alix Axel Jan 01 '14 at 15:21
  • Ok, I'm close to a solution for the reformulated problem per the previous comment. I'll be away from my terminal tomorrow, so maybe can post Friday. It's different from all the other responses so far, so you'll have a wide pick! Happy New Year! – bishop Jan 02 '14 at 02:44
  • Looks like you have a solution that fits your needs, so I'm bowing out of this. :) I'll update my answer with my approach. – bishop Jan 03 '14 at 21:37
  • Oh, my solution still has it's bugs (references and ordering of repetitions) and quirks (compatible with only ES6), I was holding for to check your update but thanks for the effort anyway! Truly appreciate it (and the X(n) formula might come handy sometime soon). =) – Alix Axel Jan 04 '14 at 00:48
0

Just want to share what I came up with, using generators and based off invRegex.py:

var ret = require('ret');

var tokens = ret('([ab]) ([cd]) \\1 \\2 z');
var references = [];

capture(tokens);
// console.log(references);

for (string of generate(tokens)) {
    console.log(string);
}

function capture(token) {
    if (Array.isArray(token)) {
        for (var i = 0; i < token.length; ++i) {
            capture(token[i]);
        }
    }

    else {
        if ((token.type === ret.types.ROOT) || (token.type === ret.types.GROUP)) {
            if ((token.type === ret.types.GROUP) && (token.remember === true)) {
                var group = [];

                if (token.hasOwnProperty('stack') === true) {
                    references.push(function* () {
                        yield* generate(token.stack);
                    });
                }

                else if (token.hasOwnProperty('options') === true) {
                    for (var generated of generate(token)) {
                        group.push(generated);
                    }

                    references.push(group);
                }
            }

            if (token.hasOwnProperty('stack') === true) {
                capture(token.stack);
            }

            else if (token.hasOwnProperty('options') === true) {
                for (var i = 0; i < token.options.length; ++i) {
                    capture(token.options[i]);
                }
            }

            return true;
        }

        else if (token.type === ret.types.REPETITION) {
            capture(token.value);
        }
    }
}

function* generate(token) {
    if (Array.isArray(token)) {
        if (token.length > 1) {
            for (var prefix of generate(token[0])) {
                for (var suffix of generate(token.slice(1))) {
                    yield prefix + suffix;
                }
            }
        }

        else {
            yield* generate(token[0]);
        }
    }

    else {
        if ((token.type === ret.types.ROOT) || (token.type === ret.types.GROUP)) {
            if (token.hasOwnProperty('stack') === true) {
                token.options = [token.stack];
            }

            for (var i = 0; i < token.options.length; ++i) {
                yield* generate(token.options[i]);
            }
        }

        else if (token.type === ret.types.POSITION) {
            yield '';
        }

        else if (token.type === ret.types.SET) {
            for (var i = 0; i < token.set.length; ++i) {
                var node = token.set[i];

                if (token.not === true) {
                    if ((node.type === ret.types.CHAR) && (node.value === 10)) {
                    }
                }

                yield* generate(node);
            }
        }

        else if (token.type === ret.types.RANGE) {
            for (var i = token.from; i <= token.to; ++i) {
                yield String.fromCharCode(i);
            }
        }

        else if (token.type === ret.types.REPETITION) {
            if (token.min === 0) {
                yield '';
            }

            for (var i = token.min; i <= token.max; ++i) {
                var stack = [];

                for (var j = 0; j < i; ++j) {
                    stack.push(token.value);
                }

                if (stack.length > 0) {
                    yield* generate(stack);
                }
            }
        }

        else if (token.type === ret.types.REFERENCE) {
            console.log(references);
            if (references.hasOwnProperty(token.value - 1)) {
                yield* references[token.value - 1]();
                // yield references[token.value - 1]().next().value;
            }

            else {
                yield '';
            }
        }

        else if (token.type === ret.types.CHAR) {
            yield String.fromCharCode(token.value);
        }
    }
}

I still haven't figured out how to implement capturing groups / references and the values yielded in the REPETITION token type are not generated in lexicographic order yet, but other than that it works.

Alix Axel
  • 141,486
  • 84
  • 375
  • 483
  • By the way, commented on your gist. – cactus1 Jan 03 '14 at 04:00
  • 1
    Good idea to push generators up the references array, but you should do so in advance, not while you are busy generating the matches. Capturing groups are numbered according to their *lexical* position; this is static and should be determined by the parser, not by the generator. By the way, in case you are still interested: I expanded my earlier answer with references too. And that was really easy. Sorry, that was a shameless plug for my own solution. ;-) – Ruud Helderman Jan 03 '14 at 21:28
  • @Ruud: Indeed, I actually noticed the exact same problem and came up with [a updated version](https://gist.github.com/alixaxel/061302f57838e92e07e6), nonetheless the `yield* references[]...` is still producing all possible options instead of the actual last generated one. Still no idea how to fix it, I will check your latest version, looks very promising! – Alix Axel Jan 04 '14 at 00:45
  • 1
    I see what you mean; I guess `yield* references[token.value - 1]();` should be replaced by something that returns just the generator's *current* value, but I could find no such a method or property on MDN; anybody got any idea? Anyway, DIY iterators give you just that little bit more control, so I'm glad you liked my approach. Please let me know if you need any more assistance. – Ruud Helderman Jan 04 '14 at 08:43
0

There already are plenty of good answers here, but I specifically wanted the generator part to work, which did not for you. It seems that you were trying to do this :

//the alphanumeric part
for (x of alpha()) for (y of numeric()) console.log(x + y);

//or as generator itself like you wanted
function* alphanumeric() {
    for (x of alpha()) for (y of numeric()) yield(x + y);
}
//iterating over it
for (var i of alphanumeric()) {
    console.log(i);
}

Output:

a0
a1
b0
b1
c0
c1

You can use this for cartesian product required in regex matching.

user568109
  • 43,824
  • 15
  • 87
  • 118