6

We have a requirement of reading/writing more than 10 million strings into a file. Also we do not want duplicates in the file. Since the strings would be flushed to a file as soon as they are read we are not maintaining it in memory.

We cannot use hashcode because of collisions in the hash code due to which we might miss a string as duplicate. Two other approaches i found in my googling:

1.Use a message digest algorithm like MD5 - but it might be too costly to calculate and store.

2.Use a checksum algorithm. [i am not sure if this produces a unique key for a string- can someone please confirm]

Is there any other approach avaiable. Thanks.

praveen
  • 231
  • 2
  • 11
  • Can the file be sorted and dedupped post creation? – jason saldo Jun 14 '10 at 13:15
  • 2
    MD5 is actually a checksum algorithm. Two different strings may have the same checksum though. – Tedil Jun 14 '10 at 13:20
  • you aren't going to get collisions with a REAL hashcode like SHA1 or the SHA variants. MD5 __IS__ a hashcode. Checksum codes are for making sure the data isn't corrupt, it won't help you with uniqueness. –  Jun 15 '10 at 04:46

6 Answers6

7

If you're okay with a microscopical risk of collisions, you could use some hash function such as MD5 as you suggest, and rely on the hashes.

Another alternative, possibly with a larger memory footprint, is to store the, already encountered strings, in a trie (a special type of tree).


Update: Yet another alternative, would be to use a Bloom filter. This however, still relies on hashing but can be adjusted to have an arbitrarily small probability of collisions.

aioobe
  • 383,660
  • 99
  • 774
  • 796
  • what do you mean by adding collision-lists for each value? – Abhinav Sarkar Jun 14 '10 at 13:23
  • trie _is_ a tree, a prefix tree – unbeli Jun 14 '10 at 13:25
  • @abhin4v. sorry. stupid idea. @unbeli. you're right of course. updated. – aioobe Jun 14 '10 at 13:27
  • +1 for trie. Also I don't think collision lists are a bad idea, but they've already stated they don't want to keep the whole thing in memory. – corsiKa Jun 14 '10 at 14:22
  • yep. that's what I realized, and why removed the suggestion :-) for a brief moment in time, I thought that only the strings that had collisions had to be stored, but obviously all strings would have had to be stored. (-: – aioobe Jun 14 '10 at 14:28
  • trie again would mean storing the strings in memory which might not work for very large number of strings. – praveen Jun 14 '10 at 16:02
  • A trie is efficient in the sense that similar strings will share memory (as opposed to, say, a naive hash-set solution). – aioobe Jun 14 '10 at 18:13
6

Storing 10 million strings in memory is indeed a lot, so I understand the reason to write it to file immediately instead of storing in e.g. a TreeSet<String> first, but where would you like to store the 10 million unique numerical keys which you want to compare with? When you want to keep it unique and numerical (which has much littler base/radix than letters), you can't make the key shorter than the string itself already is, so you won't save any memory. Or maybe at highest with data compression like GZIP, but this would only add a lot of overhead. MD5 is also inappropriate since two different strings can yield the same hash.

I really see no better solution for this than using a decent RDBMS (SQL database) wherein you set the column as UNIQUE and handle the constraint violation accordingly. A RDBMS is highly optimized for this kind of tasks.

If you really can't consider a database, then you need to re-read the file for any existing entry before the write/flush. Maybe not very fast, but certainly memory efficient.

BalusC
  • 992,635
  • 352
  • 3,478
  • 3,452
  • actually we thought if we could generate a unique no. then we could use a bitmap vector to store the strings in memory to avoid duplicates – praveen Jun 14 '10 at 16:01
  • That would still not make it more memory efficient than using a `TreeSet`. – BalusC Jun 14 '10 at 16:07
1

There is no way to make a function that would produce a unique key for a string, which is shorter than that string.
There are data structures which can solve your task. B-tree might fit if you data is large enough. Depending on the nature of your input, there might be more effective ways.

unbeli
  • 26,619
  • 5
  • 51
  • 55
1

Reliably removing duplicates is pretty much as difficult as sorting the file. As another answer indicates, there is no guaranteed way of precisely detecting duplicates without keeping a full copy of each string in memory, which seems to be exactly what you're trying to avoid.

You could keep an in-memory or on-disk index of hashcodes, and use these to retrieve actual strings from file storage for comparison, but this would essentially duplicate what a database would be able to do for you.

An alternative is to post-process the file once it's complete. The UNIX sort command is pretty good at large files (How could the UNIX sort command sort a very large file?), so I'd expect the standard UNIX command-line approach to work reasonably:

    sort my-file-of-strings.txt | uniq > my-filtered-file-of-strings.txt

(Note that files have to be sorted first before passing to uniq to remove duplicates).

If you haven't got these tools (or equivalents) available, then you can always try implementing some variant of an external merge sort yourself.

Community
  • 1
  • 1
Jon Moore
  • 11
  • 1
  • i like the post-process approach. let me find if something applicable could be found for windows boxes. – praveen Jun 14 '10 at 16:45
  • And `sort -u` can do this on its own. There is probably a GNU version of `sort` for Windows....yup: http://gnuwin32.sourceforge.net/packages/coreutils.htm – Kevin Brock Jun 15 '10 at 05:37
0

If the strings are from a fixed pool of possible strings (N), then you can use minimal perfect hashing to create an array 0...N-1. A zero in the slot determined by the perfect hash function means the string has not been seen so far.

Otherwise, the only effectively correct means outside of a lot of memory and the solutions suggested so far is to re-read the file before deciding to write the string to it.

You could do this as efficiently as possible by memory mapping portions of the file.

Adrian Regan
  • 2,220
  • 13
  • 11
0

I really think the best solution is - as someone else already suggested - to use a database.

If for some reason you can not use a database, you can still use a hashcode. Sure there will be collisions. Just add some code so that when you detect a duplicate hashcode, your program checks the file to determine if it is a genuine duplicate or a collision.

emory
  • 10,259
  • 1
  • 28
  • 55