2

I wonder what is the best way for storing huge amount of strings and checking for duplication.

We have to think about our priority:

  • duplicate check speed
  • inserting new string time
  • storage space on hard disk
  • random access time

What is the best solution, when our target is fast duplicate checking and inserting new strings time (no random access or storage space matter) ? I think about SQL database, but which of DB's is best for this solution ? If we use SQL DB, like MySQL, which storage engine will be the best ? (of course, we have to exclude memory because of data amount)

Brian Tompsett - 汤莱恩
  • 5,195
  • 62
  • 50
  • 120
Piotr Müller
  • 4,845
  • 3
  • 45
  • 79
  • Can you elaborate on what you mean by "random access time" ? If your data is a set of strings, the only operations you'll have is "add", "contains" and "delete". – Nicolas Repiquet Apr 13 '12 at 10:05
  • If you tell us more about your problem we can help you better, for example if you were taking on strings at runtime, but enough you could fit into memory, the best way would be to store them in a list/hash/array and only add items if they're not already present, then write out the array at the end (if you need it after runtime, again, elaborate). – deed02392 Apr 13 '12 at 10:22
  • are you trying to assemble a collection of distinct strings, filter duplicates, or what? what is the aim? in particular: what is the expected fractional amount of duplicates? do you expect almost everything to be a duplicate, or is that a rare event? do you want to add all new values to the database? – andrew cooke Apr 14 '12 at 01:31
  • Main target is to check if given string (hash) already was checked = searching for duplicate. Duplicate count about 1-3 per 10 checks. All new values should be added to database (can be as hash). – Piotr Müller Apr 23 '12 at 08:47

3 Answers3

5

Use a hash function on the input string. the output hash would be the primary key/id of the record.

Then you can check if the DB has this hash/id/primary key:

  • If it doesnt: this is a new string; you add a new record including the string and hash as id.
  • If it does: check that the string from the loaded record is the same as the input string.
    • if the string is the same: it is a duplicate
    • if the string is different: this is a collision. Use a collision resolution scheme to resolve. (A couple of examples below)

You will have to consider which hash function/scheme/strength to use based on speed and expected number of strings and hash collision requirements/guarantees.

A couple of ways to resolve collisions:

  • Use a 2nd hash function to come up with a new hash in the same table.
  • Mark the record (e.g. with NULL) and repeat with a stronger 2nd hash function (with wider domain) on a secondary "collision" table. On query, if the string is marked as collided (e.g. NULL) then do the lookup again in the collision table. You might also want to use dynamic perfect hashing to ensure that this second table does not have further collisions.

Of course, depending on how persistent this needs to be and how much memory you are expecting to take up/number of strings, you could actually do this without a database, directly in memory which would be a lot faster.

Preet Kukreti
  • 7,997
  • 25
  • 34
  • A hash as primary key ? How do you handle collision ? – Nicolas Repiquet Apr 13 '12 at 09:59
  • Why use a primary key at all ? A simple table with a 'hash' column (non unique) and a 'value' column (containing the string) plus a clustered index on the 'hash' column will make "select 1 from string where hash = 'hash' and value = '...'" blazing fast and quite simple, at the cost of somewhat slower insertions. – Nicolas Repiquet Apr 13 '12 at 10:19
  • @NicolasRepiquet the main motivation for a primary key is that then the DB enforces the concept of a collision. Without it, you could insert multiple records with identical hashes and different strings, and the DB would allow it. This might be desirable if you are using [seperate chaining](http://en.wikipedia.org/wiki/Hash_table#Separate_chaining), but otherwise might not be. Correct me if I'm wrong though – Preet Kukreti Apr 13 '12 at 10:35
  • "Without it, you could insert multiple records with identical hashes and different strings, and the DB would allow it." Isn't it perfectly what we want ? The hash is here to narrow the set of strings we have to compare, not to be a unique identifier. That's my point of view, of course :) – Nicolas Repiquet Apr 13 '12 at 10:48
  • @NicolasRepiquet The model you are thinking of is basically the DB-equivalent of [seperate chaining](http://en.wikipedia.org/wiki/Hash_table#Separate_chaining) strategy of hash collision resolution. In a pathological/worst case, imagine all strings being mapped to the same hash(by some sheer coincidence/crap hash function). In that worst case the hashing provides no benefit. – Preet Kukreti Apr 13 '12 at 10:57
  • Any "hash based" solution provide no benefit if the hash function always return the same hash, isn't it ? I upvoted you anyway, your solution is basically what I was thinking of. – Nicolas Repiquet Apr 13 '12 at 11:02
  • @NicolasRepiquet No, that pathological/worst case is based on a scenario of many collisions, not because the hash function is designed to always return the same value. With seperate chaining, the performance will degrade with more collisions; the more collision sets you have, the more likely you will need to then do a linear search over all items in that collision chain. But yes, with a small chain it might be faster than doing a second lookup because all data might be paged to memory. As the chains get larger (due to more collisions) it becomes a different story... – Preet Kukreti Apr 13 '12 at 11:12
4

You may want to consider a NoSQL solution:

Redis. Some of the use cases solved using Redis:

memcached. Some comparisons between memcached and Redis:

Membase/Couchbase who counts OMGPOP's Draw Something as one of their success stories. Comparison between Redis and Membase:

Some questions:

  • how large is the set of strings?
  • will the application be read heavy or write heavy? or both?
  • how often would you like data to be persisted to disk?
  • is there a N most recent strings requirement?

Hope this helps.

Community
  • 1
  • 1
user799188
  • 12,159
  • 3
  • 30
  • 35
1

Generate Suffix trees to store strings . Ukkonen's algorithm as in http://www.daimi.au.dk/~mailund/slides/Ukkonen-2005.pdf will give some insight how to create Suffix tree .There are number of ways to store this suffix tree. But once generated , the lookup time is very low.

rush00121
  • 175
  • 1
  • 5
  • 13