Unique IDs.. whats the best?

When thinking of generating unique IDs, the very first approach that comes to mind is using auto-incrementing integer IDs in traditional databases. This works well for small systems or systems with a single database. However, you cannot use this method when building large-scale distributed applications.

The reason is that there is no way for different systems to coordinate what the next sequence number should be. Consequently, all nodes will generate their own sequences (e.g., 1, 2, 3), resulting in the same ID pointing to different records across the system.

Multi-Master Replication

We can implement a variation of the auto-increment feature where, instead of increasing the next ID by 1, we increase it by k, where k is the total number of nodes in the distributed system.

The next ID generated on a single system has a difference of k from its last ID, so the IDs generated on two different nodes will never collide. This solves the scalability issue of the basic auto-increment approach. However, this method still has several issues:

It does not scale well when a server is added or removed (as k would need to change).
IDs do not increase chronologically across the entire system.
It is hard to scale across multiple data centers.

Ticket Server

A Ticket Server is another way of generating unique IDs. This approach, used by services like Flickr and Etsy, uses a centralized, dedicated server to generate globally unique primary keys.

It is relatively easy to implement and scales from small- to medium-scale systems. However, since it is centralized, it creates a single point of failure (SPOF). Any issue with the ticket server can halt or bring down the whole system. While you could introduce distributed ticket servers, that brings its own challenges, such as data synchronization.

UUID (Universally Unique Identifier) v4

A UUID is a 128-bit number used to identify information in computer systems. UUIDs have a very low probability of collision. An example is 1e653ffd-5b98-4648-8571-5848edb0b7fe.

It consists of a 128-bit (16-byte) block of data, where 6 bits are reserved for version and variant flags, and the remaining 122 bits are for random data.

Since UUID v4 is random, it can be generated without coordination between server nodes, thus avoiding synchronization issues. This makes it highly scalable, as each node can generate IDs independently. However, this approach still has a few issues:

They are non-numeric, which can be inefficient for storage compared to a 64-bit integer.
UUID v4 is purely random, meaning the IDs do not increase chronologically and are therefore not sortable.
Their 128-bit size is larger than a 64-bit integer, which can impact storage and index size.

UUID v4 provides good random data distribution. (Other random ID formats include Nano ID and CUID2.) By "random data distribution," we mean that if many UUID v4 IDs are generated and inserted, they will be scattered randomly throughout the database's index. This is good for distributed (hash-based) databases but can lead to index fragmentation and increased disk I/O in traditional single-database (B-Tree) systems.

❄️ Twitter Snowflake ID

Twitter Snowflake is a time-based unique ID generation algorithm developed by X (formerly Twitter). It was designed to address the problem of generating unique IDs for distributed systems at scale.

Snowflake was designed to overcome the shortcomings of other methods. For example, Multi-Master Replication is hard to scale, and Ticket Servers create a single point of failure. While UUID v4 avoids these issues, its random nature means it isn't time-sortable.

The Snowflake ID is a 64-bit ID designed for high-throughput, distributed systems. It consists of:

Sign Bit (1 bit): It will always be 0, reserved for future use.
Timestamp (41 bits): Milliseconds since a custom epoch. Twitter's default epoch is 1288834974657 (Nov 04, 2010). This is the most important part, as the timestamp ensures that IDs are sortable by time. 41 bits provide for ~69 years of IDs.
Datacenter ID (5 bits): Allows for 2^5 = 32 datacenters.
Machine ID (5 bits): Allows for 2^5 = 32 machines per datacenter.
Sequence number (12 bits): A counter for IDs generated within the same millisecond on the same machine. It is reset to 0 every millisecond and allows for 2^12 = 4096 IDs per millisecond, per machine.

Datacenter and machine IDs are typically chosen at startup and fixed. Any accidental change in these values can lead to ID conflicts.

ULID (Universally Unique Lexographically Sortable Identifier)

ULID is also a 128-bit (16-byte) ID that is similar to Twitter Snowflake, and UUID v7.

It uses a 48-bit timestamp (Unix epoch in milliseconds).
It is lexicographically sortable and has time-based locality.
It is not natively compatible with UUID, so migration can be difficult.
It can be generated offline without coordination.

If you want a time-sortable ID that is a native, specification-compliant, you should use UUID v7.

UUID (Universally Unique Identifier) v7

UUID v7 uses a 48-bit timestamp, 74 bits for randomness, and 6 bits for version and variant. The timestamp is based on the Unix epoch in milliseconds (not seconds). It is designed for high-load databases and distributed systems.

UUID v7 is best for systems that require records to be stored in the order they were created, such as logs, database indexes, and audit trails. UUIDs are supported as a native type by many databases; alternatively, a Binary(16) data type can be used to store them.

Which ID is the best?

Ultimately, we are left with two kinds of IDs: random and time-based sortable. The best choice depends on the kind of database you are using.

B-Tree based Databases (MySQL, PostgreSQL, etc.): Time-based IDs are best. These databases sort keys in an index. Random IDs are bad here, as they cause index fragmentation, making INSERT operations very slow. Sequential IDs are fast because they are simply appended to the end of the index.
Hash-Based Distributed Databases (DynamoDB, Cassandra, etc.): Random IDs (UUID v4) are best. Time-based IDs are bad here because they cause write hotspots, making one server do all the work. These databases "distribute" data by hashing the ID to decide which server (node) to store it on. If you use a time-based ID, all new IDs will have a similar prefix (the timestamp) and will be hashed to the same server, creating a bottleneck. Random IDs distribute the load perfectly.

Ultimately, the quest for the "perfect" unique ID reveals a core principle of system design: there is no single best solution, only the right one for the job. As we've seen, the choice between a random or a time-sortable ID is a critical decision that depends entirely on your database architecture. Choosing a random ID for a B-Tree database can lead to crippling fragmentation, while using a sequential ID on a hash-based system can create debilitating hotspots. The best ID, therefore, is the one that aligns with how your data is stored and distributed.

Unique IDs... Which is the best?

Multi-Master Replication

Ticket Server

UUID (Universally Unique Identifier) v4

❄️ Twitter Snowflake ID

ULID (Universally Unique Lexographically Sortable Identifier)

UUID (Universally Unique Identifier) v7

Which ID is the best?

Comments

More from this blog

HTTP/3: Faster Connections and Better Mobility

Don’t Block, Just Queue: The Art of Asynchronous Traffic Control.

Design Choices for Location Based Services III

Design Choices for Location Based Services / Part II

Design choices for building Location Based Services / Part I

Command Palette

Multi-Master Replication

Ticket Server

UUID (Universally Unique Identifier) v4

❄️ Twitter Snowflake ID

ULID (Universally Unique Lexographically Sortable Identifier)

UUID (Universally Unique Identifier) v7

Which ID is the best?

Comments

More from this blog