A possible algorithm for determining whether two strings are anagrams of one another?

Question

I have this idea (using C language) for checking whether two strings formed from ASCII letters are anagrams of one another:

Check if the strings are the same length.
Check if the sum of the ASCII values of all chars is the same for both strings.
Check if the product of the ASCII values of all chars is the same for both strings.

I believe that if all three are correct, then the strings must be anagrams of one another. However, I can't prove it. Can someone help me prove or disprove that this would work?

Thanks!

Proof: solve the system of the two equations. It's overspecified. If it has a solution, then it has to be trivial. — , Feb 06 '13 at 21:31
it is not that trivial- because the number of parameters is not constant.. if I prove it for 3 parameters it doesn't say it will work for 7 parameters.. at least I have no idea how to do so.. — Alex Goltser, Feb 06 '13 at 21:37
"trivial" means all-zero in this context. I was not telling you off. — , Feb 06 '13 at 21:38
it's definitely not over specified. the question he's asking is totally legit. namely, is it under specified? if not, prove it. — thang, Feb 06 '13 at 22:12

templatetypedef · Accepted Answer · 2013-02-06T21:58:28.927

I wrote a quick program to brute-force search for conflicts and found that this approach does not always work. The strings ABFN and AAHM have the same ASCII sum and product, but are not anagrams of one another. Their ASCII sum is 279 and ASCII product is 23,423,400.

There are a lot more conflicts than this. My program, searching over all length-four strings, found 11,737 conflicts.

For reference, here's the C++ source code:

#include <iostream>
#include <map>
#include <string>
#include <vector>
using namespace std;

int main() {
  /* Sparse 2D table where used[sum][prod] is either nothing or is a string
   * whose characters sum to "sum" and whose product is "prod".
   */
  map<int, map<int, string> > used;

  /* List of all usable characters in the string. */
  vector<char> usable;
  for (char ch = 'A'; ch <= 'Z'; ch++) {
    usable.push_back(ch);
  }
  for (char ch = 'a'; ch <= 'z'; ch++) {
    usable.push_back(ch);
  }

  /* Brute-force search over all possible length-four strings.  To avoid
   * iterating over anagrams, the search only explores strings whose letters
   * are in increasing ASCII order.
   */
  for (int a = 0; a < usable.size(); a++) {
    for (int b = a; b < usable.size(); b++) {
      for (int c = b; c < usable.size(); c++) {
        for (int d = c; d < usable.size(); d++) {
          /* Compute the sum and product. */
          int sum  = usable[a] + usable[b] + usable[c] + usable[d];
          int prod = usable[a] * usable[b] * usable[c] * usable[d];

          /* See if we have already seen this. */
          if (used.count(sum) &&
              used[sum].count(prod)) {
            cout << "Conflict found: " << usable[a] << usable[b] << usable[c] << usable[d] << " conflicts with " << used[sum][prod] << endl;
          }

          /* Update the table. */
          used[sum][prod] = string() + usable[a] + usable[b] + usable[c] + usable[d];
        }
      }
    }
  }
}

Hope this helps!

This looks like C++; it certainly does not look like C. And four nested for() loops don't look sexy to me. — wildplasser, May 31 '13 at 22:18
@wildplasser- My apologies - I didn't notice that this was tagged as C (I just took it as an algorithmic question). I also agree that it would be better to do this using exhaustive recursion or some other technique, but I was looking for a dead simple counterexample and hoped that this program would find one. — templatetypedef, May 31 '13 at 22:20

score 5 · Answer 2 · edited Apr 13 '17 at 12:19

5

Your approach is false; I can't explain why because I don't understand it, but there are different sets at least for cardinality 3 that have the same sum and product: https://math.stackexchange.com/questions/38671/two-sets-of-3-positive-integers-with-equal-sum-and-product

edited Apr 13 '17 at 12:19

Community

1
1

answered Feb 06 '13 at 21:50

G. Bach

3,784
2
22
43

This is really cool! However, the fact that these sets exist isn't immediately a counterexample to the approach, since those sets might not have numbers in them that are valid ASCII letters. – templatetypedef Feb 06 '13 at 21:57
You're right, it's very well possible that there are intervals of integers for which it is never possible to choose two distinct sets with the specified properties, which would imply that shifting the encoding to such an interval would make the OP's approach viable. Seems doubtful though :) – G. Bach Feb 06 '13 at 22:00

wildplasser · Answer 3 · 2013-02-07T00:03:27.513

The letters a-z and A-Z are used to index an array of 26 primes, and the product of these primes is used as a hash value for the word. Equal product <--> same letters.

(the order of the hashvalues in the primes26[] array in the below fragment is based on the letter frequencies in the Dutch language, as an attempt mimimise the expected product)

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define COUNTOF(a) (sizeof (a)/ sizeof (a)[0])

typedef unsigned long long HashVal;
HashVal hashmem (char *str, size_t len);

unsigned char primes26[] =
{
5,71,79,19,2,83,31,43,11,53,37,23,41,3,13,73,101,17,29,7,59,47,61,97,89,67,
};

struct anahash {
        struct anahash *next;
        unsigned freq;
        HashVal hash;
        char word[1];
        };

struct anahash *hashtab[1024*1024] = {NULL,};
struct anahash *new_word(char *str, size_t len);
struct anahash **hash_find(struct anahash *wp);

/*********************************************/

HashVal hashmem (char *str, size_t len)
{
size_t idx;
HashVal val=1;

if (!len) return 0;
for (idx = 0; idx < len; idx++) {
        char ch = str[idx];
        if (ch >= 'A' && ch <= 'Z' ) val *= primes26[ ch - 'A'];
        else if (ch >= 'a' && ch <= 'z' ) val *= primes26[ ch - 'a'];
        else continue;
        }
return val;
}

struct anahash *new_word(char *str, size_t len)
{
struct anahash *wp;
if (!len) len = strlen(str);

wp = malloc(len + sizeof *wp );
wp->hash = hashmem(str, len);
wp->next = NULL;
wp->freq = 0;
memcpy (wp->word, str, len);
wp->word[len] = 0;
return wp;
}

struct anahash **hash_find(struct anahash *wp)
{
unsigned slot;
struct anahash **pp;

slot = wp->hash % COUNTOF(hashtab);

for (pp = &hashtab[slot]; *pp; pp= &(*pp)->next) {
        if ((*pp)->hash < wp->hash) continue;
        if (strcmp( wp->word, (*pp)->word ) > 0) continue;
        break;
        }
return pp;
}

char buff [16*4096];
int main (void)
{
size_t pos,end;
struct anahash *wp, **pp;
HashVal val;

memset(hashtab, 0, sizeof hashtab);

while (fgets(buff, sizeof buff, stdin)) {
        for (pos=0; pos < sizeof buff && buff[pos]; ) {
                for(end = pos; end < sizeof buff && buff[end]; end++ ) {
                        if (buff[end] < 'A' || buff[end] > 'z') break;
                        if (buff[end] > 'Z' && buff[end] < 'a') break;
                        }
                if (end > pos) {
                        wp = new_word(buff+pos, end-pos);
                        if (!wp) {pos=end; continue; }
                        pp = hash_find(wp);
                        if (!*pp) *pp = wp;
                        else if ((*pp)->hash == wp->hash
                         && !strcmp((*pp)->word , wp->word)) free(wp);
                        else { wp->next = *pp; *pp = wp; }
                        (*pp)->freq +=1;
                        }
                pos = end;
                for(end = pos; end < sizeof buff && buff[end]; end++ ) {
                        if (buff[end] >= 'A' && buff[end] <= 'Z') break;
                        if (buff[end] >= 'z' && buff[end] <= 'a') break;
                        }
                pos = end;
                }
        }
for (pos = 0;  pos < COUNTOF(hashtab); pos++) {
        if (! &hashtab[pos] ) continue;

        for (pp = &hashtab[pos]; wp = *pp; pp = &wp->next) {
                if (val != wp->hash) {
                        fprintf (stdout, "\nSlot:%u:\n", pos );
                        val = wp->hash;
                        }
                fprintf (stdout, "\t%llx:%u:%s\n", wp->hash, wp->freq, wp->word);
                }
        }

return 0;
}

Won't this be subject to integer overflows for reasonably-sized strings? — templatetypedef, Feb 06 '13 at 22:44
When I created it six months ago I stress tested it with a few 100K words and found no sign of overflow. (BTW: you could always retest possible collisions) In most cases, overflow will not cause collisions (64 bits is a lot of hash space !) , but would fold around to a value that is not reachable by other paths. (omitting the 2 would be a possibility, giving faster foldover but possibly fewer collisions) — wildplasser, Feb 06 '13 at 23:01
On second thought: omitting the 2 would only generate odd numbers. Maybe then shifting right by one could cure this. — wildplasser, Feb 11 '13 at 23:30

score 4 · Answer 4 · edited Aug 12 '18 at 12:22

Thanks for such a great question! Instead of trying to disprove your proposition altogether, I spent sometime trying to find ways to augment it so it becomes true. I have the sense that if the standard deviations are equal then the two are equal. But instead of testing that far, I do a simpler test and have not found a counter example as yet. Here is what I have tested:

In addition to the conditions you mentioned before,

ASCII square-root of the sum of the squares must be equal:

I use the following python program. I have no complete proof, but maybe my response will help. Anyway, take a look.

from math import sqrt

class Nothing:



def equalString( self, strA, strB ):
    prodA, prodB = 1, 1
    sumA, sumB = 0, 0
    geoA, geoB = 0, 0

    for a in strA:
      i = ord( a )
      prodA *= i
      sumA += i
      geoA += ( i ** 2 )
    geoA = sqrt( geoA )

    for b in strB:
      i = ord( b )
      prodB *= i
      sumB += i
      geoB += ( i ** 2 )
    geoB = sqrt( geoB )

    if prodA == prodB and sumA == sumB and geoA == geoB:
      return True
    else:
      return False


  def compareStrings( self ):
    first, last = ord( 'A' ), ord( 'z' )
    for a in range( first, last + 1 ):
      for b in range( a, last + 1 ):
        for c in range( b, last + 1 ):
          for d in range( c, last + 1 ):
            strA = chr( a ) + chr( b ) + chr( c ) + chr( d )
            strB = chr( d ) + chr( c ) + chr( b ) + chr( a )

            if not self.equalString( strA, strB ):
              print "%s and %s should be equal.\n" % ( strA, strB )

    print "Done"

I also test for length five strings. – Konsol Labapen Feb 07 '13 at 00:51 — Konsol Labapen, Feb 07 '13 at 00:51

score 1 · Answer 5 · answered Feb 06 '13 at 22:08

1

If you don't mind modifying the strings, sort each of them and compare the two signatures.

answered Feb 06 '13 at 22:08

user448810

16,364
2
31
53

I don't think this answers the question. While this absolutely works, the question is specifically about the proposed algorithm. – templatetypedef Feb 06 '13 at 22:11

A possible algorithm for determining whether two strings are anagrams of one another?

5 Answers5

Linked