367

I have an application that uses GUID as the Primary Key in almost all tables and I have read that there are issues about performance when using GUID as Primary Key. Honestly, I haven't seen any problem, but I'm about to start a new application and I still want to use the GUIDs as the Primary Keys, but I was thinking of using a Composite Primary Key (The GUID and maybe another field.)

I'm using a GUID because they are nice and easy to manage when you have different environments such as "production", "test" and "dev" databases, and also for migration data between databases.

I will use Entity Framework 4.3 and I want to assign the Guid in the application code, before inserting it in the database. (i.e. I don't want to let SQL generate the Guid).

What is the best practice for creating GUID-based Primary Keys, in order to avoid the supposed performance hits associated with this approach?

abatishchev
  • 92,232
  • 78
  • 284
  • 421
VAAA
  • 12,647
  • 20
  • 110
  • 213
  • 22
    The issue is not supposed. If your PK is clustered then almost every insert has the potential to cause a page split. In modern versions of SQL Server this was "fixed" with NEWSEQUENTIALID(), but this loses the benefit of being able to calculate it beforehand. I strongly recommend you read up on GUIDs elsewhere as this is way too broad a question and will likely solicit a religious battle that will go on for hours... – Aaron Bertrand Aug 13 '12 at 16:09
  • 4
    I'd also add that the word **server** is ambiguous in *I want to assign the Guid on the* **server** *side (dont want to let SQL to create the GUID)*. – Erik Philips Aug 13 '12 at 16:11
  • This question has similarities to this "sql-server-guid-sort-algorithm-why" http://stackoverflow.com/questions/7810602/sql-server-guid-sort-algorithm-why – Clinton Ward Aug 13 '12 at 16:27
  • @AaronBertrand You would know. The issue is supposed, not because it's not measurable, but because it's negligible. – John Mar 17 '21 at 10:55

9 Answers9

529

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
 MyINT INT IDENTITY(1,1) NOT NULL,
 .... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

openshac
  • 4,245
  • 5
  • 41
  • 69
marc_s
  • 675,133
  • 158
  • 1,253
  • 1,388
  • 2
    This is a great answer, one thing I'd mention is that being able to generate the key before insert is frequently useful. Using "newsequentialid()" can help with the clustering, but that requires an additional round-trip to SQL. So another benefit of the "surrogate key" approach is that you can generate new ids, client-side, with fewer index fragmentation concerns. – Andrew Theken Feb 26 '14 at 15:15
  • Just curious. Would storing the `GUID` as a `char(32)` or `char(36)` PK solve this? Why / why not? – Fred Lackey Jul 15 '14 at 12:52
  • @FredLackey: no - same problem - since the key is totally random, excessive index fragmentation will happen. Just don't do it. – marc_s Jul 15 '14 at 13:41
  • 3
    The way I read this is that having both a non clustered uniqueidentifier column and the int identity column, FK's should also be uniqueidentifier? If you do that, when would you actually use the identity column directly, or would you not? – pinkfloydx33 Nov 01 '14 at 12:50
  • @marc_s If my GUID pk is non-clustered and I use this to join my tables (for portability across databases) then the clustered int index is redundant right? Do you foresee any problem with having no unique clustered index and doing as I describe? – A_L Feb 06 '15 at 11:54
  • @A_L: yes, a table without a clustering index is a **heap** - and that's really really bad for many reasons. Don't just toss your clustered index - it's important on so many levels! – marc_s Feb 06 '15 at 12:52
  • 3
    Little question, should the GUID now be used on joins, or the int id? My instinct tells me the GUID should be used, but I fail to see a technical problem using the int id... – Nicolas Belley Jun 27 '15 at 13:33
  • @NicolasBelley: the `int` is probably a bit more efficient, since it's 4x smaller in sheer size ... – marc_s Jun 27 '15 at 21:47
  • 5
    @marc_s but in a replication scenario, if the int column is identity, shouldn't we use the GUID since the int column can repeat itself across devices? – Nicolas Belley Jun 28 '15 at 11:51
  • Great information, but as with most things, the right choice depends upon the needs of your application. I do however feel like the article would be more balanced with discussion of the GuidComb strategy. The "cheap space isn't the point" article also has some great information, but I'm not sure the test scenarios are optimal for application developers. Using Identity over generating the key app side invariably leads to round-trips. I think comparisons around typical DDD object graph persistence scenarios would help give a more accurate picture to help with the decision process. – Derek Greer Jul 22 '15 at 21:44
  • 3
    This is an old thread, but might I add: don't just use a useless arbitrary INT as the clustering key. Use something useful like an incerementing date that is actually searched on, that has some relation to the data you're storing. You only get one clustering key, and if you choose the right one you'll get good performance – Nick.McDermaid Jul 09 '16 at 08:42
  • @marc_s although outdated, I totally agree with Nick.McDemermaid, but maybe I am missing something?! I am very curious about you're opinion an what Nick.McDemermaid has to say!! – Kip ei Sep 25 '17 at 13:25
  • 7
    @Kipei: the main issues is the ***I-F*** you have such a natural value - then yes, you can use it as a primary key. ***BUT***: values like `DATETIME` for instance are **NOT** useful for a clustering key, since they have a 3.33ms accuracy only, and thus duplicates can exist. So in such a case, you **still* need an `INT IDENTITY` instead - therefore, I typically use that by default, since frmo my 20+ years of experience, a really usable *natural key* hardly ever really exists .... – marc_s Sep 25 '17 at 13:27
  • I currently work on `EF6 Code first` web application (sql server db 2012 0r 2017) and want to apply DDD concepts so I need a unique key in advance before inserting in DB, Many recommends `UUID` But I'm afraid of the performance issue, Could u help me please to take the right decision. Should I use GUID instead of auto increments key – Anyname Donotcare Jun 26 '18 at 09:52
  • @AnynameDonotcare: I'd still strongly recommend using an auto-increment `INT` or `BIGINT`. What makes you think you *must know* the ID value before saving?? I fail to see how DDD or any other design approach mandates this...... – marc_s Jun 26 '18 at 11:05
  • 1
    `I've learned that I should keep my model in a valid state and when creating a new instance of a class it's recomended to put all required attributes as constructor parameters.` So I converted all my IDs to `GUID` instead of `int`, Should I use both of them ? one as auto increment to solve the technical issue and the GUID to solve the DDD issue ? – Anyname Donotcare Jun 26 '18 at 11:10
  • @marc_s I've seen this method of adding a surrogate int key to cluster on discussed before. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data? I know everyone says a heap is bad and performs worse, but also it seems there are many opinions out there. – Dale K Jul 05 '19 at 02:58
  • 1
    @DaleBurrell: check out https://www.sqlskills.com/blogs/kimberly/the-clustered-index-debate-continues/ - and any blog post by Kimberly Tripp for that matter - she's the "Queen of indexing" and I'm pretty sure there's performance testing numbers in her blog somewhere to show this very clearly – marc_s Jul 05 '19 at 03:36
  • Thanks @marc_s - I have read most of her stuff and didn't see data to back that up. But will look again. on your link she says "Oh – and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help" - but doesn't expand on that. – Dale K Jul 05 '19 at 03:42
64

I've been using GUIDs as PKs since 2005. In this distributed database world, it is absolutely the best way to merge distributed data. You can fire and forget merge tables without all the worry of ints matching across joined tables. GUIDs joins can be copied without any worry.

This is my setup for using GUIDs:

  1. PK = GUID. GUIDs are indexed similar to strings, so high row tables (over 50 million records) may need table partitioning or other performance techniques. SQL Server is getting extremely efficient, so performance concerns are less and less applicable.

  2. PK Guid is NON-Clustered index. Never cluster index a GUID unless it is NewSequentialID. But even then, a server reboot will cause major breaks in ordering.

  3. Add ClusterID Int to every table. This is your CLUSTERED Index... that orders your table.

  4. Joining on ClusterIDs (int) is more efficient, but I work with 20-30 million record tables, so joining on GUIDs doesn't visibly affect performance. If you want max performance, use the ClusterID concept as your primary key & join on ClusterID.

Here is my Email table...

CREATE TABLE [Core].[Email] (
    [EmailID]      UNIQUEIDENTIFIER CONSTRAINT [DF_Email_EmailID] DEFAULT (newsequentialid()) NOT NULL,        
    [EmailAddress] NVARCHAR (50)    CONSTRAINT [DF_Email_EmailAddress] DEFAULT ('') NOT NULL,        
    [CreatedDate]  DATETIME         CONSTRAINT [DF_Email_CreatedDate] DEFAULT (getutcdate()) NOT NULL,      
    [ClusterID] INT NOT NULL IDENTITY,
    CONSTRAINT [PK_Email] PRIMARY KEY NonCLUSTERED ([EmailID] ASC)
);
GO

CREATE UNIQUE CLUSTERED INDEX [IX_Email_ClusterID] ON [Core].[Email] ([ClusterID])
GO

CREATE UNIQUE NONCLUSTERED INDEX [IX_Email_EmailAddress] ON [Core].[Email] ([EmailAddress] Asc)
Dale K
  • 16,372
  • 12
  • 37
  • 62
Robert J. Good
  • 1,087
  • 9
  • 8
  • Could you explain the PK_Email constraint? Why you have ... NonClustered(EmailID ASC) instead of ...Nonclustered(ClusterID ASC) ? – Phil Sep 02 '17 at 15:47
  • 2
    You bet. Two main things going on with indexes: 1. Clustered on ClusterID - Orders your table on disk (0% fragmentation). 2. NonClustered on EmailID - Indexes the EmailID field to speed up GUID ID lookups. A GUID field lookup behaves string-ish, so a EmailID lookup would be slow without the index. – Robert J. Good Sep 03 '17 at 16:28
  • @RobertJ.Good I've seen this method discussed before i.e. adding a surrogate int key to cluster on. But I can't find anywhere which shows the performance gain in having a surrogate key clustered index over using a heap. Do you have any links to benchmark data? – Dale K Jul 05 '19 at 02:55
  • 1
    Hi @DaleBurrell, the clustered index is to prevent table fragmentation. Performance gain happens as the table naturally grows in order on disk, with low fragmentation. – Robert J. Good Aug 20 '19 at 20:10
  • @RobertJ.Good Is that a web application? What are you using in urls/hrefs? guid or int? – dariol Dec 05 '19 at 09:34
  • @dariol There are security implications, so drop the newsequentialid() and expose a NewId() Guid if no other choice (definitely not the Int.) I'd recommend a claims based and/or token approach, or even brute-force encryption for any identifiers. In short, avoid exposing any Ids, and avoid any value that can be guessed, or worse +1 to find the next record. – Robert J. Good Dec 23 '19 at 23:29
  • 1
    @RobertJ.Good when you mention "In this distributed database world, it is absolutely the best way to merge distributed data." do you mean you eventually merge the records to a master database? Wondering what happens the the clusterID then, how do you handle duplicates once you merge the "source"? – jfrobishow Jan 28 '20 at 22:40
11

I am currently developing an web application with EF Core and here is the pattern I use:

All my classes (tables) have an int PK and FK. I then have an additional column of type Guid (generated by the C# constructor) with a non clustered index on it.

All the joins of tables within EF are managed through the int keys while all the access from outside (controllers) are done with the Guids.

This solution allows to not show the int keys on URLs but keep the model tidy and fast.

Dale K
  • 16,372
  • 12
  • 37
  • 62
EricImhauser
  • 541
  • 6
  • 14
3

This link says it better than I could and helped in my decision making. I usually opt for an int as a primary key, unless I have a specific need not to and I also let SQL server auto-generate/maintain this field unless I have some specific reason not to. In reality, performance concerns need to be determined based on your specific app. There are many factors at play here including but not limited to expected db size, proper indexing, efficient querying, and more. Although people may disagree, I think in many scenarios you will not notice a difference with either option and you should choose what is more appropriate for your app and what allows you to develop easier, quicker, and more effectively (If you never complete the app what difference does the rest make :).

https://web.archive.org/web/20120812080710/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

P.S. I'm not sure why you would use a Composite PK or what benefit you believe that would give you.

JustinStolle
  • 3,732
  • 2
  • 32
  • 46
Matt
  • 901
  • 5
  • 12
  • Totally agree!! But that means that if I have a GUID as PK or a Composite PK with GUID and other field is going to be the same right? – VAAA Aug 13 '12 at 16:24
  • 1
    The PK (index) would be made up of the two columns, but unless you have some business specific reason for doing this, it seems unnecessary. – Matt Aug 13 '12 at 16:29
  • 1
    BTW this question is one of the most polarizing and debated questions out there and therefore extremely difficult to get an answer for that you will feel 100% comfortable with. Either method comes with trade-offs, so good luck :) – Matt Aug 13 '12 at 16:43
2

If you use GUID as primary key and create clustered index then I suggest use the default of NEWSEQUENTIALID() value for it.

Dale K
  • 16,372
  • 12
  • 37
  • 62
AnandPhadke
  • 12,218
  • 5
  • 23
  • 33
1

Having sequential ID's makes it a LOT easier for a hacker or data miner to compromise your site and data. Keep that in mind when choosing a PK for a website.

Zonus
  • 1,664
  • 2
  • 18
  • 35
  • Can you provide any logic or evidence to back up this claim? I'm struggling to see how a sequential id might compromise security. – jonaglon Jan 28 '20 at 10:03
  • 1
    Sure, if you know ID numbers are integer you can guess sequentially records in a DB. So if you query a single item, you can say that the next item is pk + 1. If you have random GUIDS, it will not follow a pattern. It would be nearly impossible to query other records than the one you previously queried (And know the PK). – Zonus Jan 28 '20 at 15:29
  • 4
    If a hacker can query your database you're already compromised, I fail to see how sequential id's make the situation worse. – jonaglon Jan 29 '20 at 09:37
  • No. That is not true. I do a lot with pen testing and am well known for catching hackers. Do I always use GUIDs and not int, no. But if I need to protect data I will rely on data techniques to protect as well as programming and there's a LOT of reasons to do this. Take this older example. When you see a URL like www.domain.com/edit/1012 that makes me cringe. The 1012 is the key of the record and can be switched out. With reactive sites this changes a bit but can still be seen. And it's complex to protect records at that point. Know your data and protect what needs to be protected. – Zonus Jan 30 '20 at 14:32
  • 3
    If a user can switch out 1012 for another number and see data they shouldn't then there is a very serious security issue, that issue isn't caused by the primary key choice but it is exacerbated by it. I do take your point, thank you for spelling it out. – jonaglon Jan 30 '20 at 15:27
  • I highlight this as a requirement for any app that deals with hypersensitive data like HIPAA or SOX. Relying on only programming as a security restraint is dangerous. It's best to use multiple methods when protecting sensitive data. – Zonus Jan 30 '20 at 15:50
  • 2
    You may use a GUID to locate a record at the web page, that is not the PK of the table. Using query parameter in a website should not define how you structure your DB schema. The PK has nothing to do with input and parameters in UI or backend system. – Panos Roditakis Jan 30 '20 at 21:37
0

Most of the times it should not be used as the primary key for a table because it really hit the performance of the database. useful links regarding GUID impact on performance and as a primary key.

  1. https://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
  2. https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/
0

Well, if your data never reach millions of rows, you are good. If you ask me, i never use GUID as database identity column of any type, including PK even if you force me to design with a shotgun at the head.

Using GUID as primary key is a definitive scaling stopper, and a critical one. I recommend you check database identity and sequence option. Sequence is table independent and may provide a solution for your needs(MS SQL has sequences).

If your tables start reaching some dozens of millions of rows the most, e.g. 50 million you will not be able read/write information at acceptable timings and even standard database index maintenance would turn impossible.

Then you need to use partitioning, and be scalable up to half a billion or even 1-2 billion rows. Adding partitioning on the way is not the easiest thing, all read/write statements must include partition column (full app changes!).

These number of course (50 million and 500 million) are for a light selecting useage. If you need to select information in a complex way and/or have lots of inserts/updates/deletes, those could even be 1-2 millions and 50 millions instead, for a very demanding system. If you also add factors like full recovery model, high availability and no maintenance window, common for modern systems, things become extremely ugly.

Note at this point that 2 billion is int limit that looks bad, but int is 4 times smaller and is a sequential type of data, small size and sequential type are the #1 factor for database scalability. And you can use big int which is just twice smaller but still sequential, sequential is what is really deadly important - even more important than size - when to comes to many millions or few billions of rows.

If GUID is also clustered, things are much worst. Just inserting a new row will be actually stored randomly everywhere in physical position.

Even been just a column, not PK or PK part, just indexing it is trouble. From fragmentation perspective.

Having a guid column is perfectly ok like any varchar column as long as you do not use it as PK part and in general as a key column to join tables. Your database must have its own PK elements, filtering and joining data using them - filtering also by a GUID afterwards is perfectly ok.

-1

Another reason not to expose an Id in the user interface is that a competitor can see your Id incrementing over a day or other period and so deduce the volume of business you are doing.

MikeW
  • 1