Pattern in continuous sequence data

Question

Suppose I have a list of events. For example A, D, T, H, U, A, B, F, H, ....

What I need is to find frequent patterns that occur in the complete sequence. In this problem we cannot use traditional algorithms like a priori or fp growth because they require separate item sets. And, I cannot break this stream into smaller sets.

Any idea which algorithm would work for me?

EDIT

For example, for the sequence A, D, T, H, U, A, D, T, H, T, H, U, A, H, T, H and with min_support = 2.

The frequent patterns will be

Of length 1 --> [A, D, T, H, U]
Of length 2 --> [AD, DT, TH, HU, UA, HT]
Of length 3 --> [ADT, DTH, THU, HUA]
Of length 4 --> [ADTH, THUA]
No sequences of length 5 and further

I think the question is far too broad, but as a first guess, you might want to have a look at [iSAX](http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html) — Marco13, Oct 18 '15 at 11:25
I just want to find frequent patters of all lengths in that one large stream. I could not find anything on the Internet after searching a lot. — Haris, Oct 18 '15 at 11:28
["String" compression](https://en.wikipedia.org/wiki/Lossless_compression#General_purpose) algorithms try to capitalise on (at least locally) predictable non-uniformity in sequence probability. — greybeard, Nov 09 '15 at 09:20
@greybeard, i didn't get you completely. Can you explain a little more please. — Haris, Nov 09 '15 at 16:03
Far as I remember, J.A. Storer was the one introducing the no(ta)tion of "text contraction" using _Original Pointer Macros_ (OPM), _External Pointer Macros_, the combination thereof, and _Compress Pointer Macros_ (EPM, OEPM, CPM) - the optimal use of all of which has been proven to be intractable. (Macro: (start, length)). Of the variations and restrictions, using original pointers in one direction only allowed a linear solution starting at the other end; information about possible targets coming from a suffix tree. (It's been a couple of decades, a suffix array might be to-day's choice.) — greybeard, Nov 10 '15 at 09:08
Can you please edit and add in your question a sample data and the frequent patterns in it? — displayName, Nov 12 '15 at 20:05
It would be interesting to know what the sizes and limits are. Is the stream very long? How much memory is available for sequence storage? How many passes are acceptable. The LZ compression algorithm produces a data structure for repeating sequences. However, is tuned for a sliding window. — mksteve, Nov 15 '15 at 15:19

Gigamegs · Accepted Answer · 2015-11-16T10:27:06.460

2

You can try aho-corasick algorithm with a wildcard and/or just with all substrings. Aho-corasick is basically a finite state machine it needs a dictionary but then it find multiple pattern in the search string very fast. You can build a finite state machine with a trie and a breadth-first search. Here is nice example with animation:http://blog.ivank.net/aho-corasick-algorithm-in-as3.html. So you need basically 2 steps: build the finite state machine and search the string.

edited Nov 16 '15 at 10:27

answered Nov 12 '15 at 19:48

Gigamegs

12,342
7
31
71

Its very close to building a *suffix tree* for all the possible substrings, and then using that to check for patterns later. Actually, that is what I am considering. – Haris Nov 13 '15 at 05:08

score 0 · Answer 2 · answered Oct 18 '15 at 11:48

0

You can generate all possible substrings, eg.:

A
AD
ADT
ADTH
...
D
DT
DTH
...

Now the question is, does the order of elements the smaller substrings matter.

If not you can try and run standard association mining algorithms.

If yes, then the order matters in the whole sequence and its subsequences, which makes this a signal processing or time series problem. But even if the order matters we can continue analyzing this way, with all substrings. We can try matching them, exact match or fuzzy match and stuff like that.

answered Oct 18 '15 at 11:48

dimm

1,563
9
14

Won't that take a lot of time for a very big sequence. To generate all possible substrings only will take exponential time. – Haris Oct 18 '15 at 16:00
There are n^2 substrings. I think it's feasible. – dimm Oct 19 '15 at 09:26
that seems feasible, but i need to store each sequence with its frequency of occurrence to select the optimal one. – Haris Oct 21 '15 at 10:02

score 0 · Answer 3 · answered Oct 18 '15 at 14:58

0

That is a particular variation of frequent itemset mining, known as sequential pattern mining.

If you look for this topic, you will find literally dozens of algorithms.

There is GSP, SPADE, PrefixSpan, and many more.

answered Oct 18 '15 at 14:58

Has QUIT--Anony-Mousse

70,714
12
123
184

One cannot use GSP. or SPADE because they work on already appearing sequences that are seperate from one another. Not one big continuous sequence. – Haris Oct 18 '15 at 15:59
1

You could run it on ngrams of that sequence then, for example. – Has QUIT--Anony-Mousse Oct 18 '15 at 16:09
I didn't get you, can you elaborate a little by editing your answer. – Haris Oct 21 '15 at 10:01

score 0 · Answer 4 · answered Nov 09 '15 at 16:21

Here's a simple algorithm (in JavaScript) that will generate a count of all substrings.

Keep a count of substring occurrences in a dictionary. Iterate over every possible substring in the stream, and if it is already in the dictionary, increment it, otherwise add it with a value of 1.

var stream = 'FOOBARFOO';
var substrings = {};
var minimumSubstringLength = 2;

for (var i = 1; i <= stream.length; i++) {
    for (var j = 0; j <= i - minimumSubstringLength; j++) {
        var substring = stream.substring(j, i);
        substrings[substring] ? substrings[substring]++ : substrings[substring] = 1;
    }
}

Then use a sorting algorithm to order the dictionary by its values.

Yes, thats already been suggested. But i want something more efficient then bruteforce. — Haris, Nov 09 '15 at 16:23

Pattern in continuous sequence data

4 Answers4