Trivial task - complex solution?

Question

There is a trivial problem:

assign uniqueidentifier to any externalId
do not overwrite the uniqueidentifier once it is assigned - just return existing uniqueidentifier

Imagine a table

  ExternalId | Guid
--------------------------------
   some1     | accf-0334-dfdf-....

Now, the twist is the scale. We want billions of externalIds to be mapped like this and we need to be able to assign these identifiers fast (thousands/sec)

We started of with a simple SQL Server table but it was not performing well. We moved the same schema to a Cassandra ColumnFamily - the writes are super fast and its sharded but: before writing we have to read (to make sure the externalId is not assigned already) so we hit the read seek I/O limit again.

Hashing (to determine uniqueidentifier) is unfortunately not possible as we already have hundreds of millions assigned. Caching is problematic because in most cases we are assigning a 'brand new externalId' so it wouldn't be in the database at all.

Does anybody have any suggestions for the solution here?

If you use SQL Server and cluster on `ExternalID` it should be fast enough for your purposes. Were there any indexes in your test table? — JNK, Oct 26 '11 at 17:53
@JNK im probably not thinking about this right, but if they are constantly adding more values, won't indexing slow them down as its recalculating? — Chris, Oct 26 '11 at 17:58
@Chris - see my answer below. I think the index is the fastest way to deal with this. — JNK, Oct 26 '11 at 18:00
@Piotr - How are you receiving these keys? If you were to run a solution that wasn't ACID compliant (read: much faster), what would happen if there was a catastrophic error? Could you re-run over the keys that you've lost, or are they coming in real time? I think a full fledged RDMS is a lot of overhead for a simple key/value mapping, but you can't beat SQL's out of the box durability. — Mike M., Oct 26 '11 at 18:17
@MikeM. - I think since he needs to not duplicate keys he will probably need ACID, since eventual consistency won't cut it. — JNK, Oct 26 '11 at 18:27

score 1 · Answer 1 · answered Oct 26 '11 at 18:00

Use SQL server, and create your table like so:

UNIQUE Cluster on ExternalID
Default value on Guid for NEWID()
Make sure the unique clustered index on ExternalID has IGNORE_DUP_KEY = ON

With this scenario, you always do the same two (super-fast) operations:

1 - Insert the ExternalID
2 - Query the GUID for the ExternalID

It won't allow duplicates, but they won't throw an error. It will also be a highly optimized seek because of the clustered index.

You will need to rebuild the index frequently because you will get a high degree of fragmentation over time (since you are clustering on a non-incremental varchar) but it should meet your other requirements.

score 0 · Answer 2 · answered Oct 26 '11 at 18:09

This is an interesting problem, and honestly, beyond the scope of my knowledge. However, I found it interesting, and stumbled across this link - http://blogs.msdn.com/b/miah/archive/2008/02/17/sql-if-exists-update-else-insert.aspx

It seems like this method skips the initial full table scan, which should increase performance. This isn't the best answer, but perhaps you can customize the general idea to use for your specific DB implementation. (I've never heard of Cassandra, so it looks like I need to do some research.) Even if it doesn't, it might give you some ideas (I hope). Don't know if you have tried something like this already, but I thought I'd share the link. I wish you the best of luck.

Trivial task - complex solution?

2 Answers2