0

We are given a pattern string: 'foo' and a source string: 'foobaroofzaqofom' and we need to find all occurrences of word pattern string in any order of letters. So for a given example solution will looks like: ['foo', 'oof', 'ofo'].

I have a solution, but i'm not sure that it is the most efficient one:

  1. Create hash_map of chars of pattern string where each char is a key and each value is a counter of chars in pattern. For a given example it would be {{f: 1}, {o: 2}}
  2. Look through the source string and if found one of the elements from hash_map, than try to find all the rest elements of pattern
  3. If all elements are found than it is our solution, if not going forward

Here is an implementation in c++:

set<string> FindSubstringPermutations(string& s, string& p)
{
    set<string> result; 
    unordered_map<char, int> um;

    for (auto ch : p)
    {
        auto it = um.find(ch);
        if (it == um.end())
            um.insert({ ch, 1 });
        else
            um[ch] += 1;
    }

    for (int i = 0; i < (s.size() - p.size() + 1); ++i)
    {
        auto it = um.find(s[i]);
        if (it != um.end())
        {
            decltype (um) um_c = um;
            um_c[s[i]] -= 1;
            for (int t = (i + 1); t < i + p.size(); ++t)
            {
                auto it = um_c.find(s[t]);
                if (it == um_c.end())
                    break;
                else if (it->second == 0)
                    break;
                else
                    it->second -= 1;
            }

            int sum = 0;
            for (auto c : um_c)
                sum += c.second;

            if (sum == 0)
                result.insert(s.substr(i, p.size()));
        }
    }

    return result;
}

Complexity is near O(n), i don't know how to calculate more precisely.

So the question: is there any efficient solution, because using hash_map is a bit of hacks and i think there may be more efficient solution using simple arrays and flags of found elements.

fryme
  • 251
  • 4
  • 15

2 Answers2

1

You could use a order-invariant hash-algorithm that works with a sliding window to optimize things a bit.

An example for such a hash-algorithm could be

int hash(string s){
    int result = 0;

    for(int i = 0; i < s.length(); i++)
        result += s[i];

    return result;
}

This algorithm is a bit over-simplistic and is rather horrible in all points except performance (i.e. distribution and number of possible hash-values), but that isn't too hard to change.

The advantage with such a hash-algorithm would be:

hash("abc") == hash("acb") == hash("bac") == ...

and using a sliding-window with this algorithm is pretty simple:

string s = "abcd";

hash(s.substring(0, 3)) + 'd' - 'a' == hash(s.substring(1, 3));

These two properties of such hashing approaches allow us to do something like this:

int hash(string s){
    return sum(s.chars);
}

int slideHash(int oldHash, char slideOut, char slideIn){
    return oldHash - slideOut + slideIn;
}

int findPermuted(string s, string pattern){
    int patternHash = hash(pattern);
    int slidingHash = hash(s.substring(0, pattern.length()));

    if(patternHash == slidingHash && isPermutation(pattern, s.substring(0, pattern.length())
        return 0;

    for(int i = 0; i < s.length() - pattern.length(); i++){
        slidingHash = slideHash(slidingHash, s[i], s[i + pattern.length()]);

        if(patternHash == slidingHash)
            if(isPermutation(pattern, s.substring(i + 1, pattern.length())
                return i + 1;
    }

    return -1;
}

This is basically an altered version of the Rabin-Karp-algorithm that works for permuted strings. The main-advantage of this approach is that less strings actually have to be compared, which brings quite a bit of an advantage. This especially applies here, since the comparison (checking if a string is a permutation of another string) is quite expensive itself already.

NOTE:
The above code is only supposed as a demonstration of an idea. It's aimed at being easy to understand rather than performance and shouldn't be directly used.

EDIT:
The above "implementation" of an order-invariant rolling hash algorithm shouldn't be used, since it performs extremely poor in terms of data-distribution. Of course there are obviously a few problems with this kind of hash: the only thing from which the hash can be generated is the actual value of the characters (no indices!), which need to be accumulated using a reversible operation.

A better approach would be to map each character to a prime (don't use 2!!!). Since all operations are modulo 2^(8 * sizeof(hashtype)) (integer overflow), we need to generate a table of the multiplicative inverses modulo 2^(8 * sizeof(hashtype)) for all used primes. I won't cover generating these tables, as there's plenty of resources available on that topic here already.

The final hash would then look like this:

map<char, int> primes = generatePrimTable();
map<int, int> inverse = generateMultiplicativeInverses(primes);

unsigned int hash(string s){
    unsigned int hash = 1;
    for(int i = 0; i < s.length(); i++)
        hash *= primes[s[i]];

    return hash;
}

unsigned int slideHash(unsigned int oldHash, char slideOut, char slideIn){
    return oldHash * inverse[primes[slideOut]] * primes[slideIn];
}

Keep in mind that this solution works with unsigned integers.

Paul
  • 13,100
  • 3
  • 17
  • 34
  • Your hash function is absolutely wrong: ABC and BBB have same value, more generally you never can hash a big string into a small integer without collision. And after all hashing is not always doing the magic (at least in the worst case). At the end to resolve the collisions, using buckets, may result to have many strings in one bucket. – Saeed Amiri Jan 11 '17 at 15:53
  • Also it is not O(n) at all, it is O(nm) which n is the size of original string and m is the size of smaller string. This answer is a perfect example of using a good idea in a wrong place. A simple brute force algorithm is also O(nm). – Saeed Amiri Jan 11 '17 at 15:59
  • @SaeedAmiri I've mentioned it quite a few times in the post: that hash-algo should demonstrate an idea and be easy to understand, not actually get used. Arguing a hashing-algo doesn't work because there are collisions is a bit of a weird claim. As for `O(n)`, I have absolutely no clue where you got that claim from, but definitely not from my answer. As for the approach itself, this is a variation of the Rabin-Carp-algorithm, which is designed to find a short pattern within a long string, so I don't see why this should be not a case to use this algorithm. – Paul Jan 11 '17 at 16:27
  • @SaeedAmiri As for `O(nm)`: even Boyer-Moore doesn't get beyond `O(nm)` as worst case. Oh and last but not least: there's another [answer](http://stackoverflow.com/a/41593241/4668606) that uses the precise same approach with a proper hash-algo. – Paul Jan 11 '17 at 16:28
  • The point is that you didn't even demonstrate the main idea, take a look at your algorithm again, you are doing 's.substring(i + 1, pattern.length()' for O(n) times. It means that the algorithm or the idea, which you suggested is O(nm), all technique of Rabin Karp was to avoid checking all such substrings. Otherwise it is not wise to put yourself in much troubles at the end provide an algorithm which doesn't work properly, and have a same running time as a trivial correct algorithm. – Saeed Amiri Jan 11 '17 at 17:15
  • BTW I'll check the other answer as well – Saeed Amiri Jan 11 '17 at 17:17
  • @SaeedAmiri well, it's not made entirely clear in the answer, but the substring won't even be called most of the time. Probably it's just that I don't know any languages that don't have this feature, but I simply used boolean short-circuiting. So the `s.substring(i + 1, ...)` won't be called unless the hashes match. I'll make that a bit more clear. – Paul Jan 11 '17 at 17:20
  • Aha I see your point, I read it as an algorithm and didn't think about possible short circuit. Now per your edit it is more clear. – Saeed Amiri Jan 11 '17 at 17:26
  • BTW, my suggestion is at least at the end, once you explained the idea, mention a good hash function. – Saeed Amiri Jan 11 '17 at 17:27
  • @SaeedAmiri I didn't think about this to be honest. I'm just that used to using short-circuiting that I do it without even thinking about it. Yeah, I guess it'd be the minimum to mention a hashing-algo that could actually be used here. I'll edit. – Paul Jan 11 '17 at 17:29
-1

Typical rolling hashfunction for anagrams

  • using product of primes
  • This will only work for relatively short patterns
  • The hashvalues for allmost all normal words will fit into a 64 bit value without overflow.
  • Based on this anagram matcher

/* braek; */
/* 'foobaroofzaqofom' */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef unsigned long long HashVal;
static HashVal hashchar (unsigned char ch);
static HashVal hashmem (void *ptr, size_t len);

unsigned char primes26[] =
{ 5,71,79,19,2,83,31,43,11,53,37,23,41,3,13,73,101,17,29,7,59,47,61,97,89,67, };
/*********************************************/
static HashVal hashchar (unsigned char ch)
{
HashVal val=1;

if (ch >= 'A' && ch <= 'Z' ) val = primes26[ ch - 'A'];
else if (ch >= 'a' && ch <= 'z' ) val = primes26[ ch - 'a'];

return val;
}

static HashVal hashmem (void *ptr, size_t len)
{
size_t idx;
unsigned char *str = ptr;
HashVal val=1;

if (!len) return 0;
for (idx = 0; idx < len; idx++) {
        val *= hashchar ( str[idx] );
        }

return val;
}
/*********************************************/


unsigned char buff [4096];
int main (int argc, char **argv)
{
size_t patlen,len,pos,rotor;
int ch;
HashVal patval;
HashVal rothash=1;

patlen = strlen(argv[1]);
patval = hashmem( argv[1], patlen);
// fprintf(stderr, "Pat=%s, len=%zu, Hash=%llx\n", argv[1], patlen, patval);

for (rotor=pos=len =0; ; len++) {
        ch=getc(stdin);
        if (ch == EOF) break;

        if (ch < 'A' || ch > 'z') { pos = 0; rothash = 1; continue; }
        if (ch > 'Z' && ch < 'a') { pos = 0; rothash = 1; continue; }
                /* remove old char from rolling hash */
        if (pos >= patlen) { rothash /= hashchar(buff[rotor]); }
                /* add new char to rolling hash */
        buff[rotor] = ch;
        rothash *= hashchar(buff[rotor]);

        // fprintf(stderr, "%zu: [rot=%zu]pos=%zu, Hash=%llx\n", len, rotor, pos, rothash);

        rotor = (rotor+1) % patlen;
                /* matched enough characters ? */
        if (++pos < patlen) continue;
                /* correct hash value ? */
        if (rothash != patval) continue;
        fprintf(stdout, "Pos=%zu\n", len);
        }

return 0;
}

Output/result:


$ ./a.out foo < anascan.c
Pos=21
Pos=27
Pos=33

Update. For people who don't like product of primes, here is a taxinumber sum of cubes (+ additional histogram check) implementation. This is also supposed to be 8-bit clean. Note the cubes are not necessary; it wotks equally well with squares. Or just the sum. (the final histogram check will have some more work todo)


/* braek; */
/*  'foobaroofzaqofom' */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef unsigned long long HashVal;
static HashVal hashchar (unsigned char ch);
static HashVal hashmem (void *ptr, size_t len);

/*********************************************/
static HashVal hashchar (unsigned char ch)
{
HashVal val=1+ch;

return val*val*val;
}

static HashVal hashmem (void *ptr, size_t len)
{
size_t idx;
unsigned char *str = ptr;
HashVal val=1;

if (!len) return 0;
for (idx = 0; idx < len; idx++) {
        val += hashchar ( str[idx] );
        }

return val;
}
/*********************************************/
int main (int argc, char **argv)
{
size_t patlen,len,rotor;
int ch;
HashVal patval;
HashVal rothash=1;
unsigned char *patstr;
unsigned pathist[256] = {0};
unsigned rothist[256] = {0};
unsigned char cycbuff[1024];

patstr = (unsigned char*) argv[1];
patlen = strlen((const char*) patstr);
patval = hashmem( patstr, patlen);

for(rotor=0; rotor < patlen; rotor++) {
        pathist [ patstr[rotor] ] += 1;
        }
fprintf(stderr, "Pat=%s, len=%zu, Hash=%llx\n", argv[1], patlen, patval);

for (rotor=len =0; ; len++) {
        ch=getc(stdin);
        if (ch == EOF) break;

                /* remove old char from rolling hash */
        if (len >= patlen) {
                rothash -= hashchar(cycbuff[rotor]);
                rothist [ cycbuff[rotor] ] -= 1;
                }
                /* add new char to rolling hash */
        cycbuff[rotor] = ch;
        rothash += hashchar(cycbuff[rotor]);
        rothist [ cycbuff[rotor] ] += 1;

        // fprintf(stderr, "%zu: [rot=%zu], Hash=%llx\n", len, rotor, rothash);

        rotor = (rotor+1) % patlen;
                /* matched enough characters ? */
        if (len < patlen) continue;
                /* correct hash value ? */
        if (rothash != patval) continue;
                /* correct histogram? */
        if (memcmp(rothist,pathist, sizeof pathist)) continue;
        fprintf(stdout, "Pos=%zu\n", len-patlen);
        }

return 0;
}

Community
  • 1
  • 1
wildplasser
  • 38,231
  • 6
  • 56
  • 94
  • If you provide an algorithm which uses multiplications of primes, and it only works for small patterns then what is the purpose of your algorithm, a naive algorithm is much more faster both in theory and in practice. – Saeed Amiri Jan 11 '17 at 17:23
  • That's what I said: it only works for reasonable sized search terms, like words in a text. And naive may be faster in practice, but not in theory. Mine is O(N), naive is O(N*M), with M the number of permutations in the search term. – wildplasser Jan 11 '17 at 17:58
  • Reasonable size: What is the last time you used a word of 20 or more characters in a text? And the performance of mult/div is not that bad nowadays. – wildplasser Jan 11 '17 at 20:02
  • I don't think a pattern necessarily is a single word. In fact it can be a set of key words. Also if you assume that word of small size like 20, then naive algorithm is already O(n). BTW your approach, except your hash function, is not very bad. – Saeed Amiri Jan 11 '17 at 21:57