0

I have a lot of strings, and I need to check how many pairs contain the same characters.

Currently, my strategy is to create an int[128] chars, and for each character in the string, to increment the count in chars. So, at the end, chars would be a 128-sized array, with each index mapping to a character number, and the value its count.

I'd then hash chars, say by using Java's Arrays.hashCode() function.

Is there a more efficient way to approach this? I tried XOR-ing each character in the string in the first loop (together with building chars), which works (but is terribly slow on my assignment test cases; I suspect they are designed to defeat a simple XOR hash function). Are there any efficient hash functions that work?

Fabian
  • 347
  • 3
  • 11
  • Can you give some sample input, along with the exact output you would want to see for that input? – Tim Biegeleisen Oct 17 '18 at 06:47
  • As far as I understand your problem, your `hash`-function has to be commutative. Have you tried to hash each `String` and add up the single hash-values? – Turing85 Oct 17 '18 at 06:48
  • @TimBiegeleisen I don't have access to some of the test cases, but the requirements state that each line is an arbitrary length of ASCII characters (excluding the newline character). – Fabian Oct 17 '18 at 08:08
  • @Turing85 Wouldn't that be the same as my XOR for each character except that I add instead of XOR? – Fabian Oct 17 '18 at 08:09
  • @Fabian I don't know. I do not know your code. The easiest way would be to make a performance analysis (given you actually have a performance problem). – Turing85 Oct 17 '18 at 08:27
  • Depending on the limits on string length, you might just replace the int[] array with a char[] array, or even a byte[] array. Smaller arrays mean fewer bytes to hash which could result in a modest improvement in speed. – President James K. Polk Oct 17 '18 at 20:25
  • Are you certain that your alphabet is limited to 128 different characters? Also, don't assume that if `hash(A) == hash(B)`, then `A == B`. That's not how hashes work. The number of unique 128-integer arrays is many times larger than the largest possible 64-bit hash value, so by the [Pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle) you're guaranteed that there are multiple arrays that hash to the same value. – Jim Mischel Oct 18 '18 at 04:53
  • Yup, because it's guaranteed to be within the ASCII range of character codes (0 to 127). Yeah, I'm using Arrays.equals() to compare in the event of a collision. – Fabian Oct 18 '18 at 04:57
  • The search term is *anagram* :: https://stackoverflow.com/a/11117236/2235885 – joop Oct 19 '18 at 09:47

1 Answers1

1

Sort the characters in each string. That is to say, you first destroy all of the order information. After that a standard HashMap should suffice.

Tom Hawtin - tackline
  • 139,906
  • 30
  • 206
  • 293
  • 1
    This works well for shorter to modest length strings, for very long strings an int[128] might be best. – Peter Lawrey Oct 17 '18 at 07:38
  • That'd require multiple passes to sort (using counting sort for instance), and another pass to hash, so it'd be slower than my current method which requires only a single pass. – Fabian Oct 17 '18 at 08:11