Exploring the Right Fit: Choosing Primary Keys for Your Database

Author:Murphy | View: 29388 | Time: 2025-03-22 20:44:47

Recently, I had the opportunity to design and integrate a simple chat system. While sketching the architecture, I spent considerable time contemplating the choices for primary keys. This topic seems to resurface repeatedly, and upon reading other articles, I realised that it goes deeper than I had previously assumed. Although I am not entirely new to data design and its various pitfalls, I had never fully appreciated the profound impact of primary key selection on system performance and scalability. In this article, I want to share my thoughts on the choices I made and the reasoning behind them.

It's important to note that I am working with Postgres. While I believe most of this article can be applied to other databases, I will occasionally ignore this crucial detail and discuss the arguments at a higher level. However, keep in mind that some arguments may not be applicable outside of Postgres.

My work focused on integrating a new feature into an existing mature system, which means there is already a significant amount of context for the participating entities. The following figure and assumptions provide a simplified snapshot of the mental model I used during the design process. By sharing my thought process and the considerations that went into selecting primary keys, I hope to outline to the ongoing discussion surrounding this aspect of Database Design.

Boxes and arrows with meaning (Image by the author)

Several key factors influence the choice of primary keys:

High volume of events: The system is expected to handle a high volume of events with unpredictable bursts.
Event ordering: Consumers expect to receive events in a specific order.
Room-based retrieval: Most consumers will attempt to retrieve events for one or more specific rooms.
Partial data retrieval: Consumers will often retrieve data partially, depending on time.
Distributed system: The system operates in a distributed environment, where complete events are constructed outside of the database.
Access frequency: Older events will be accessed less frequently compared to newer events.

Based on these factors, I draw the following assumptions:

Table partitioning: The tables will be partitioned (vertically or horizontally) depending on the infrastructure constraints.
Table clustering: The tables might be clustered based on read/write metrics to optimise performance.
Time-sortable primary key: The primary key must be sortable by time to facilitate efficient querying and ordering of events.
External primary key generation: The primary key won't be (solely) generated at the database level, as events are constructed in a distributed system.
Indexing performance: The indexing performance should be acceptable even during high bursts, and minimising storage overhead would be preferable.

Retrospective on Primary Key Data Types: Formalising Criteria

After gathering my thoughts and information about typical primary key data types for the above design, I looked back at the problem and tried to formalise the criteria I inherently considered. This retrospective aims to clarify what I was looking for in primary keys. Note that I am specifically focusing on sufficient criteria rather than necessary ones like uniqueness and irreducibility, which I assume to be implied.

Simplicity

The complexity of the primary key should be manageable and take the audience into account. While complexity can be hidden, it's important to ensure that the primary key remains understandable and usable for database administrators and developers. A subset of this characteristic is the appearance of primary keys which is relevant in specific cases. For instance, considering the case of an URL shortener, a 128-bit UUID may span multiple rows and be difficult to copy-paste due to dashes, while a simple integer does not have this problem.

Predictability

Predictability is derived from a security and business metrics standpoint. A predictable primary key can expose sensitive information or business metrics, which may be undesirable. This is often times brought up with auto incrementing integers, when exposed, its not only easy to guess the next key, but at the same time they expose information about how many keys came prior.

Scalability

A scalable primary key should not require a coordinator, as this can become a bottleneck. The primary key should support horizontal scaling efficiently. The broader aspect of scalability is often discussed with means of synchronicity of key generation, meaning that clients basically have to queue to get the next key, which in a highly competing distributed setup can be a problem for scaling.

Size

The size of the primary key influences performance and the efficiency of secondary indexes. Smaller keys generally perform better and consume less storage. I want to emphasise that storage has some nuances besides the primary key takes up a specific number of bytes, since secondary indexes will use the primary key, some rippling effects to general performance of the system will occur. Generally its true that the less size the better, but not always when for example considering collisions and randomness.

Index Performance

Sequential ordering of primary keys can enhance index performance. However, this depends on the generation method. For instance, generating keys in a non-sequential order (e.g. creating 2 before 1) can degrade the performance of B-Trees. UUIDs, in particular, can perform poorly with large datasets due to their random nature.

Collisions

The frequency and impact of collisions are critical. In high-frequency systems, even a low collision probability can be disastrous. For example, in a high-frequency payment system, a collision rate of 1 in 10⁹⁹ could still result in multiple collisions per day, which is unacceptable.

Adoption

The database should natively support the chosen primary key data type. I want to avoid storing keys in an inefficient format (e.g. as strings) and ensure that the database and client libraries can handle validation and ordering correctly. Additionally, it's beneficial to use well-supported and battle-tested client libraries.

With these criteria in mind, I can now move forward to discuss different types of primary keys, evaluating candidates.

Natural Identifiers: The Intuitive Primary Key ?

When I first started with data design, I learned that natural primary keys are advantageous because they satisfy many requirements without introducing the ambiguity of artificial keys. However, in my experience, they rarely occur in a useful manner. In this specific case, using a timestamp with sufficient accuracy as a primary key seemed attractive initially. However, from a semantic perspective, the timestamp alone does not uniquely identify an event, even when considering collisions and sufficient accuracy. What I mean by this is that even if we had used a sufficiently high-resolution timestamp according to the requirements, a unique event would always have been a combination of user and time. The assumption here is that a user cannot cause multiple events simultaneously. This is a plausible assumption, and the consequences of an unlikely collision are negligible. On the other hand, the likelihood of two events occurring at the "exact" same time from different users is much higher. Without delving into mathematical estimates, which can be just a numbers game without plausible assumptions, the timestamp alone seems unsuitable as a primary key. Before completely discarding the timestamp and moving to a composite key, it's important to note that using a timestamp alone has other difficulties as well. For example, when letting the database generate the key, a transaction can generate the same timestamp for every row in the transaction if the default value is set to now(). Considering these factors, it becomes clear that relying solely on a timestamp as a primary key may not be the best approach. Instead, exploring composite keys or alternative strategies that incorporate additional unique identifiers, such as user information, could provide a more robust and semantically meaningful solution for uniquely identifying events in the system.

The next logical step seems to be combining columns to create a unique primary key, which is generally referred to as a composite key. While this approach maintains the simplicity of a natural key in terms of semantics, it introduces a significant amount of complexity when working with the key. I have never had a pleasant experience working with composite keys, even when completely disregarding the performance implications, so I rarely use them at all. Nevertheless, I don't want to dismiss the option simply based on personal preference.

In my case, a possible composite key could be a combination of user, room, and timestamp, under the assumptions that a user can only be in one room at a time and, as mentioned earlier, a single user cannot produce multiple events simultaneously. However, even simple tasks become complicated when working with the resulting data. A prime example are foreign key relationships, where the referencing table has to include all columns of the composite key. Additionally, maintaining composite keys can be time-consuming since adding new columns may require altering the primary key, which is not directly possible. I believe it's already evident why I personally do not favour composite keys.

Lastly, let's consider what could be called a hybrid surrogate key, which is a mix of natural and artificial keys. This involves adding a column outside of the relevant domain but constructing the artificial key using components of the natural columns. While I have encountered such keys, I believe they combine the worst of both worlds. Query semantics get lost, and the disadvantages of both natural and surrogate keys are inherited. The only immediate advantage is the potential for size reduction, which is not mandatory and depends on how the key is constructed and stored. I have to admit that I don't have extensive experience with such keys, so there might be advantages I am not aware of.

The Case for Auto-Increments

A well known source for artificial primary keys are auto-incrementing integers. In my career this was the most prominent I came across when working with relational data, however in recent years my work shifted a lot more towards distributed architectures, such that other types are increasingly employed.

In Postgres I tend to use a BIGSERIAL to implement auto-incrementing keys but there are other techniques using sequences as well. There are obvious reasons to choose an integer, regarding my criteria above, integers are very simple both to implement and easy to understand. Adoption is as good as it can get, as there is no database without a concept for natural numbers, not that I'm aware of at least. Furthermore, they are sortable and broadly speaking there is not much work to do in generating an integer nor comparing two integers. In terms of size there is the unspoken rule to only ever use 64 Bit BIGINT since the overhead of dealing with overflows and following migrations when using a 32 Bit integer is not worth it. For illustration, let's assume my system receives 256 events per second. That means, we run out of keys in under one year when using 32 Bit, but with 64 Bits, we run out of keys in 585 billion years. Collision here is rather not applicable as a criterium since by design serials are usually generated at the database. This implies, they have to be orchestrated to ensure monotonicity. As an interesting side note, it's not guaranteed that serials increase in the smallest step of one at a time, there might be e.g. rollback transactions which introduce gaps.

A popular argument against serials is a semi-technical or at least can be seen as two arguments. The more technical aspect is the security argument which is about predictability of identifiers. When someone can guess identifiers based on what data is visible, even on an abstract imaginary system, it's clear that the more people know the better they can exploit the system. However, when your strongest defence is obscurity your probably screwed anyways. While I think this is a valid argument and surely pentesters love to have these identifiers, defence could be mitigated to other areas than data design.

The other argument derived from this is a business perspective and goes hand in hand with the leakage of implicit information. When your identifiers increase with your data you naturally expose characteristics of your dataset when handing out rows. For example, when I get back comment 950 from a blog API, I might conclude that this comment received at least 949 comments before that. This is obviously undesirable in certain businesses, especially with high competition. However, there are techniques to obfuscate these numbers such that clients cannot conclude anything from the identifiers they can see. While to this day I'm still waiting for an use case where such a strategy would be a good-fit, they are worth mentioning. One clever technique is the use of Knuth's multiplicative integer hash which maps your integer keys onto random integers in a reversible way, such that no domain information is retained and exposed.

My main concern for the chat design was one characteristic of scalability and some other requirements. Firstly, we run into a problem with distributed systems where the necessity for the primary key is one central orchestrator. We rely on incrementing a number to achieve uniqueness, which is a stateful operation. In order to increment you have to know your predecessor and it's hard to increment concurrently for multiple competing parties. Therefore, the identifier generation is either centrally at the database or involves some other complex orchestration. This is especially a problem when the system demands that multiple different parts should be able to construct a valid row without blocking each other.

The Case for Random Identifiers

While I like to argue about random identifiers, it's easier to follow along when I directly refer to the most prominent example, which is arguably the Universal Unique Identifier (UUID) or Global Unique Identifier (GUID) as defined by RFC 9562. When I refer to UUIDs im referring to version four for this section, since it's the version that is solely about the concept of randomness. Other versions of the standard employ other concepts, for example another variant that is often ridiculed is version one. It includes the MAC-Address of the generating host to guarantee collision free identifiers, which elevates the argument of exposed information on another level.

Typically UUIDs are represented with hexadecimal numbers in a layout with braces or hyphens, where the first number in the third group is fixed and denotes the version number. Postgres output function uses the depicted layout with dashes, however technically UUIDs are just 128 bit labels.

In my concrete case, there are a couple of benefits when using UUIDs. First of all, it's a popular and long-existing standard, thus it's widely adopted and well known. RFC 9562 includes a list of 16 similar alternatives which were taken into account when preparing the specification. Most of these identifiers are concerned with a particular problem, like trying to reduce storage size while keeping reasonable collision probability, semantic ordering, or better randomness. To not exceed the size of this article, I want to provide a short introduction to some alternatives in order to shed some light on what I mean by "improving on something". Starting with Cuid and Nanoid, which are popular alternatives in the realm of web development. On the one hand, Cuid has higher entropy when generating identifiers, thus reducing the probability of collisions. Nanoid, on the other hand, is focused on formatting and random generation by using a different layout and character alphabet. This reduces the chance of collisions and, due to its simple format without special characters, makes the identifier better for certain use cases. Nanoid's layout is designed to work well when included in URLs, and its alphabet can be customised per use case. This flexibility allows for optimising the generated identifiers to meet specific requirements, such as ensuring URL compatibility, improving human readability, or making it easier to copy and paste the identifier by omitting characters like dashes. There is an interesting article that discusses why the layout of unique identifiers matters and how it can be tailored to specific use cases, which you can read here.

Generally speaking, the biggest benefit of random identifiers is that they can be generated anywhere and do not require coordination. As a direct consequence, UUIDs take up more space. To achieve sufficient entropy, more space is required compared to simply counting up sequentially. However, for most use cases, this is a largely theoretical argument. When I examined my micro benchmarks, I concluded that unless I'm designing a massive-scale, billion-dollar data-grave, I can safely use UUIDs without significant performance concerns.

The main argument against using purely random identifiers in databases is related to how databases organize data and the locality of random values. The primary key is the determining factor in how the database "treats" the corresponding data. Some databases, such as MySQL (InnoDB), store data in order based on the primary key. In Postgres, the equivalent is a clustered index. When working with data stored in order, inserting random values inevitably causes problems at larger scale. Storing values randomly leads to fragmentation, constant reorganisation of pages, and significant I/O overhead. This is because the database engine needs to constantly rearrange the data to maintain the order dictated by the primary key. As a result, the database has to perform more disk seeks and read/write operations, which can significantly impact performance, especially as the dataset grows larger.

Even when data is not stored in order, random insertion is not optimal for indexing. The B-Tree data structure, commonly used for managing indexes, is a graph structure that relies on constant rebalancing to maintain its beneficial characteristics. To avoid frequent rebalancing of the tree, monotonic inserts are preferred. These kinds of inserts do not cause frequent rebalancing of B-Trees because they maintain the natural order of the tree structure. When inserting keys in a monotonically increasing or decreasing order, the new keys are always added to the rightmost or leftmost leaf node of the B-Tree, respectively. This is because the new key is always greater than (or less than) the existing keys in the tree. As a result, the B-Tree's structure remains mostly unchanged, and the tree only needs to be rebalanced when a leaf node becomes full and needs to be split. This process of rebalancing and splitting nodes is relatively infrequent and only affects a small portion of the tree. On the other hand, random inserts cause the tree to be rebalanced on nearly every insert. Each insertion may require traversing the tree to find the appropriate position for the new key, and if a node becomes full, it needs to be split and the keys redistributed. This process of rebalancing and splitting nodes leads to decreased performance. While it's not mandatory to use a B-Tree index for Primary Keys, it's often the default choice for a reason. B-Tree indexes offer beneficial characteristics, such as efficient range queries and the ability to enforce unique constraints, which are important for primary keys and should not be traded off lightly.

In summary, it's worth noting that random keys, such as UUIDs, can have performance implications in databases. However, before diving into the details of these implications, it's essential to evaluate whether the system can handle potential performance issues and if performance, specifically insert performance, will be a concern at all. In my concrete case, the scale of the system does not justify a more sophisticated analysis. I simply benchmarked and tested how far I could push the performance using UUIDs in a lab environment and concluded that the performance would be sufficient for the intended use cases. In addition, in my scenario, the system extends beyond web, collisions are allowed to happen and Postgres already has built-in support for UUIDs.

Moving Beyond Random Values

Randomness implies, that order, sorting and comparing values cannot be guaranteed in a semantic way. Sure you can sort an UUID lexicographically when considering the text representation, but not by any characteristic like time or space. Thus without secondary indexes the primary key is not very flexible when considering query requirements. These limitations are well known and summarised in RFC 9562. This makes the RFC a good resource to get an idea about the evolution of these identifiers. The given problems and some other UUID specific aspects lead to the specification of a newer version solving some of the prior problems. Specifications like the Universally Unique Lexicographically Sortable Identifier (ULID) were considered in the development of RFC 9562. Version seven of UUID serves as a good example of an identifier that combines both order and randomness, making it suitable for use cases where both properties are desired.

Before diving into identifier variants I want to point out that I often use the terms order or sorting in this article very broadly. However, when looking at the details, this problem domain has a more precise notion of ordering, termed k-sorting. A sequence is considered k-sorted when each element is at most k positions away from its correctly sorted position. In other words, if an element in a k-sorted sequence were to be sorted, it would need to move no more than k positions to the left or right to reach its final sorted location. This means that when I state an identifier is sortable or has an order, it's often only "roughly sortable". The concept of k-sorting is particularly relevant when considering identifiers that include a timestamp component. Due to the potential for clock drift and limited precision, the order of the identifiers may not be perfect. However, they will still be roughly sorted, with each element being close to its correct position in the sequence. This is sufficient for many use cases where strict ordering is not required, but having a general sense of order is beneficial.

This section presents a straightforward concept: the identifier should remain distributed while addressing the issues associated with randomness. A simple initial approach is to make the identifier sortable and monotonic for inserts by prepending a "distributed" deterministic value, such as the creation timestamp. This prefix makes the identifier sortable and quasi-monotonic. Ignoring the layout details, UUID version seven, as illustrated below, is a popular variant of this concept.

Interestingly, the specification describes some parameters to tune which are worth to discuss. While the first 48 bits are an Unix Epoch Timestamp, the remaining 74 bits are filled with random values. However the specification suggests different methods regarding the use case. Either monotonicity is more important or randomness. Does the use case demand more entropy, use more space for randomness or stronger algorithmic guarantees. Does the use case demand stronger monotonicity, use a randomly seeded counter. Taking into account the previously discussed drawbacks, the tradeoffs repeat. More randomness means less performance, while increasingly accurate monotonicity increases performance but introduces predictability. This seems the central point of discussion when looking at identifiers like Cuid2, where on their github repository the argument against monotonicity is emphasised. Another close relative is ULID which partly inspired version seven of UUID. When comparing ULID and UUID, it appears that while both have the same concept of introducing a time based component, UUID aims to be more rigid in specification. ULID has some open issues regarding the distributed generation function and gaps in the specification. So generally speaking I think its advisable to use UUID over other ULID implementations.

Building upon the concept of compositional identifiers, we can explore the designs that prioritise distributed concurrent generation across multiple hosts. These variants include a machine identifier instead of, or in combination with, a timestamp component. UUIDs, for instance, offer variants that incorporate a machine identifier, either in place of or alongside a timestamp component. This approach is exemplified by Snowflake identifiers. They are more compact than UUIDs, have a different structure and are composed of three main parts:

Timestamp (41 bits): Represents the milliseconds since a custom epoch (e.g. November 4th, 2010). This provides a time-based component that ensures the IDs are roughly sortable, but is potentially more compact depending on the epoch chosen.
Worker (10 bits): Represents the unique identifier of the worker (or machine) that generated the identifier. This allows multiple machines to generate identifiers in parallel without collisions.
Sequence number (12 bits): Represents a sequence number that increments for every identifier generated within the same millisecond by the same worker. This ensures that multiple identifiers generated by the same worker within the same millisecond are unique.

This structure differs from UUIDs in several ways. First, Snowflake identifiers do not include a purely random component. Instead, they combine a timestamp, a worker identifier, and a sequence number to generate unique identifiers that are sortable and can be generated in a distributed manner. Second, they are more compact, using only 64 bits compared to the 128 bits used by UUIDs. The combination of timestamp, worker identifier, and sequence number in Snowflake provides several advantages:

Monotonic sequence: The timestamp component ensures that IDs are roughly sortable, as they are generated in a monotonically increasing order.
Semantic order: The timestamp also provides a semantic meaning to the order of the identifier, as they represent the creation time of the entity.
Distributed guarantees: The worker component allows multiple machines to generate identifiers in parallel without collisions.

This compact and structured form is particularly useful in large-scale distributed systems, such as Twitter or Discord, where they have been adopted. In this context, it's often important to consider the implications of time precision and clock drift of machines. Perfect synchronisation and unlimited precision may not be available to the average system. However, for most people, including myself, this is a theoretical consideration, as we are not working on extremely large-scale software systems at the time of writing.

Before ending this section, I want to provide some inspiration on how prior identifiers can be extended without a specific database-related problem in mind. A well-known identifier format, which many readers may be familiar with, is the one used by Stripe. Consider this example taken directly from their documentation of a customer object, where the identifier includes a topic-specific prefix.

Stripe Customer ID (Image by the author)

The central concern of this design from what I know is the clarity or transparency of the identifier. A customer identifier can be immediately associated with the customer entity without any deeper knowledge of the domain. However this concept can maintain compatibility with UUID, or whatever identifier it's based on, when the prefix is not incorporated into the database layer. A good example of a generalisation of this concept is typeid.

A Useful Trick Along the Way

As a final reflection on identifier design, I want to briefly explore some considerations unique to my initially proposed chat system. While deciding to use a variant of UUID was relatively straightforward, I encountered a challenge when enabling optimistic updates. To provide a seamless user experience where messages appear instantly without waiting for server confirmation, generating identifiers on the client side becomes necessary. However, this approach introduces potential complexities and safety concerns.

To address this issue, I devised a mechanism where the server generates a separate primary key while still allowing the client to use UUIDs for optimistic updates. When a user sends a new message, the client generates a UUID and optimistically displays the message in the chat interface. The client sends the UUID along with the message data to the server. The server then creates the message record using the client-provided UUID and generates a separate primary key for the message, storing both the UUID and the primary key in the database record.

Upon successful creation of the message record, the server sends the generated primary key back to the client as part of the response. The client stores the primary key alongside the UUID for future reference. For subsequent actions related to the message, such as edits or deletions, the client can send either the UUID or the primary key to the server. The server uses the primary key to quickly locate the message record in the database and verifies that the provided primary key matches the stored UUID. If there is a match, the server proceeds with the requested action. If there is no match, the server rejects the request, indicating a potential data inconsistency or tampering attempt.

To enable fast lookups based on the UUID, I implemented a hash index on the server that maps UUIDs to their corresponding primary keys. This hash index allows for efficient O(1) lookups, enabling the server to quickly retrieve the primary key associated with a given UUID. By keeping the hash index in memory, the server can perform these lookups without accessing the disk, minimising latency. However, I acknowledge that the hash index introduces an additional overhead during message insertion. The server needs to generate the primary key and store it alongside the UUID, which may slightly increase the latency of insert operations. Nevertheless, I consider this trade-off acceptable given the benefits of data integrity and efficient lookups.

This approach aims to leverage the advantages of using UUIDs on the client for optimistic updates while ensuring data integrity and efficient lookups on the server side. The server generates a separate primary key, maintains a hash index for fast UUID-to-primary key lookups, and validates the UUID-primary key pair on subsequent requests. Although there is an insert performance footprint, I believe the overall benefits of this approach outweigh the drawbacks in this context.

Some Final Thoughts

After writing this article I realise why so much content in this domain tends to be very superficial and vague. While it seems simple on first sight, the choice of the right primary key comes with a surprising amount of tradeoffs, variants, technical details and implications. However this article is pretty much a dressed up version of my very fragmented notes of this topic and I hope it provides added value for others who are looking to gain a deeper understanding of primary key selection. The goal was to go beyond the typical high-level advice and dive into the nuances and considerations that factor into choosing an appropriate primary key for a given database table or entity. By examining the different types of keys, their pros and cons, and the scenarios they are best suited for, readers should come away with a more comprehensive view of this fundamental aspect of database design.

Do not hesitate to contact me when you find something wrong, I tried to be as precise as possible for the format of a blog post. For that connect with me at mastodon or x.com.