6

Suppose I have a list of events. For example A, D, T, H, U, A, B, F, H, ....

What I need is to find frequent patterns that occur in the complete sequence. In this problem we cannot use traditional algorithms like a priori or fp growth because they require separate item sets. And, I cannot break this stream into smaller sets.

Any idea which algorithm would work for me?


EDIT

For example, for the sequence A, D, T, H, U, A, D, T, H, T, H, U, A, H, T, H and with min_support = 2.

The frequent patterns will be

Of length 1 --> [A, D, T, H, U]
Of length 2 --> [AD, DT, TH, HU, UA, HT]
Of length 3 --> [ADT, DTH, THU, HUA]
Of length 4 --> [ADTH, THUA]
No sequences of length 5 and further
maraca
  • 7,323
  • 3
  • 20
  • 41
Haris
  • 11,514
  • 6
  • 36
  • 63
  • I think the question is far too broad, but as a first guess, you might want to have a look at [iSAX](http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html) – Marco13 Oct 18 '15 at 11:25
  • I just want to find frequent patters of all lengths in that one large stream. I could not find anything on the Internet after searching a lot. – Haris Oct 18 '15 at 11:28
  • ["String" compression](https://en.wikipedia.org/wiki/Lossless_compression#General_purpose) algorithms try to capitalise on (at least locally) predictable non-uniformity in sequence probability. – greybeard Nov 09 '15 at 09:20
  • @greybeard, i didn't get you completely. Can you explain a little more please. – Haris Nov 09 '15 at 16:03
  • Far as I remember, J.A. Storer was the one introducing the no(ta)tion of "text contraction" using _Original Pointer Macros_ (OPM), _External Pointer Macros_, the combination thereof, and _Compress Pointer Macros_ (EPM, OEPM, CPM) - the optimal use of all of which has been proven to be intractable. (Macro: (start, length)). Of the variations and restrictions, using original pointers in one direction only allowed a linear solution starting at the other end; information about possible targets coming from a suffix tree. (It's been a couple of decades, a suffix array might be to-day's choice.) – greybeard Nov 10 '15 at 09:08
  • Can you please edit and add in your question a sample data and the frequent patterns in it? – displayName Nov 12 '15 at 20:05
  • @displayName, Edited.. – Haris Nov 13 '15 at 05:20
  • It would be interesting to know what the sizes and limits are. Is the stream very long? How much memory is available for sequence storage? How many passes are acceptable. The LZ compression algorithm produces a data structure for repeating sequences. However, is tuned for a sliding window. – mksteve Nov 15 '15 at 15:19

4 Answers4

2

You can try aho-corasick algorithm with a wildcard and/or just with all substrings. Aho-corasick is basically a finite state machine it needs a dictionary but then it find multiple pattern in the search string very fast. You can build a finite state machine with a trie and a breadth-first search. Here is nice example with animation:http://blog.ivank.net/aho-corasick-algorithm-in-as3.html. So you need basically 2 steps: build the finite state machine and search the string.

Gigamegs
  • 12,342
  • 7
  • 31
  • 71
  • Its very close to building a *suffix tree* for all the possible substrings, and then using that to check for patterns later. Actually, that is what I am considering. – Haris Nov 13 '15 at 05:08
0

You can generate all possible substrings, eg.:

A
AD
ADT
ADTH
...
D
DT
DTH
...

Now the question is, does the order of elements the smaller substrings matter.

If not you can try and run standard association mining algorithms.

If yes, then the order matters in the whole sequence and its subsequences, which makes this a signal processing or time series problem. But even if the order matters we can continue analyzing this way, with all substrings. We can try matching them, exact match or fuzzy match and stuff like that.

dimm
  • 1,563
  • 9
  • 14
  • Won't that take a lot of time for a very big sequence. To generate all possible substrings only will take exponential time. – Haris Oct 18 '15 at 16:00
  • There are n^2 substrings. I think it's feasible. – dimm Oct 19 '15 at 09:26
  • that seems feasible, but i need to store each sequence with its frequency of occurrence to select the optimal one. – Haris Oct 21 '15 at 10:02
0

That is a particular variation of frequent itemset mining, known as sequential pattern mining.

If you look for this topic, you will find literally dozens of algorithms.

There is GSP, SPADE, PrefixSpan, and many more.

Has QUIT--Anony-Mousse
  • 70,714
  • 12
  • 123
  • 184
0

Here's a simple algorithm (in JavaScript) that will generate a count of all substrings.

Keep a count of substring occurrences in a dictionary. Iterate over every possible substring in the stream, and if it is already in the dictionary, increment it, otherwise add it with a value of 1.

var stream = 'FOOBARFOO';
var substrings = {};
var minimumSubstringLength = 2;

for (var i = 1; i <= stream.length; i++) {
    for (var j = 0; j <= i - minimumSubstringLength; j++) {
        var substring = stream.substring(j, i);
        substrings[substring] ? substrings[substring]++ : substrings[substring] = 1;
    }
}

Then use a sorting algorithm to order the dictionary by its values.

James Brierley
  • 4,473
  • 1
  • 18
  • 37
  • Yes, thats already been suggested. But i want something more efficient then bruteforce. – Haris Nov 09 '15 at 16:23