How can you safely send a message to a specific receiving party and guarantee that nobody else can eavesdrop or alter the contents? This is a fundamental problem for communication systems. For much of the traffic across the internet, the solution involves Transport Layer Security (TLS) and Public Key Infrastructure (PKI).

Together these components provide two essential security functions:

  • Encryption–making messages unreadable and tamperproof to anyone but the intended recipient
  • Authentication–allowing users to prove their identity to each other.

In this article, I explain the core ideas and illustrate them with a simple example, a three-node distributed SQL database. In this article, I guide the reader through the example of implementing a simple PKI hierarchy to enable TLS-secured traffic between components in a distributed database cluster.

Key pairs

The fundamental mechanism of TLS is a pair of asymmetric cryptographic keys, usually referred to as a ‘key pair’ for short.

A private key is generated by a cryptographically secure pseudo-random number generator, meaning that it is unguessable for all practical purposes—so it must be both long enough and chosen in an unpredictable way). A public key is computed from the private key in such a way that:

  1. Text encrypted with the public key can be decrypted with the private key. Very usefully, this means that anyone who has the public key can use it to encrypt a message and send it to you, knowing* that you alone can decrypt it with the private key*.
  2. The private key can be used to compute a cryptographic signature of another file, a special hash that can be validated with the public key.Very usefully, this means that anyone who receives a message or file that was cryptographically signed with your private key (verified with the corresponding public key) knows* it must have come from you*.

* This knowledge rests on the all important assumption that only you have the key. The entire system depends on you keeping your private key private while sharing your public key. If the key is stolen, you can be impersonated. One implication is that for the key pair to be considered cryptographically secure, the private key must be unguessable even given the public key. As computers become more powerful, the bar for what is ‘unguessable’ is continually raised. The Internet Engineering Task Force (IEFT), who originally designed the TLS protocol, continue to update their standards for secure TLS cipher suites. See IEFT’s Recommendations for Secure Use of Transport Layer Security (TLS) and Datagram Transport Layer Security (DTLS)

Asymmetric encryption, employing a key pair, can be contrasted with symmetric encryption, where the the same key is used to encrypt and decrypt data. This is similar to a metal key in a mechanical lock-and-key system, used both to open and close a lock. Symmetric encryption is much more efficient than asymmetric encryption, but cannot be used to communicate across a network unless all parties know the key. With asymmetric encryption, by contrast, each party can share their public key, allowing anyone to send them messages that only they can decrypt with their private key.

The TLS protocol employs both asymmetric and symmetric encryption to combine the advantages of each:

  • First, asymmetric encryption employing the public/private key pairs is used to securely exchange random data that is used to create shared session keys
  • Session keys are then used by both parties to (symmetrically) encrypt and decrypt data for the remainder of the session, and then discarded.

Since a session key is created with information shared using asymmetric encryption, and is used only for the duration of a session, TLS combines the best of both asymmetric encryption (the ability to establish an encrypted communication channel without having to have first securely shared a key) and symmetric encryption (computational efficiency).

Keys and identities

Encryption is powerful and important, but without authentication, it’s not very useful: if you don’t know who is sending you messages in the first place, it’s less useful to know that nobody else is tampering with the contents or eavesdropping on the conversation.

Using key-pairs, messages are not just encrypted, the encryption may correspond to an identity, if you know who holds key that matches the one you use to send the message. You can encrypt messages for a specific party with a public key, knowing only the holder of the corresponding private can decrypt it. You can also sign the message in a way that shows it was sent by you, the holder of the private key, in a way that can be by verified by anyone who can access the public hash. However, the ability of a key pair to authenticate a particular human, website, database node, or other party, depends on the security of the key. If you encrypt a message with a public key, you know that only a holder of the matching private key can decrypt it. But even if the holder of the public certificate identifies themselves as your friend, your bank, your employer, or a government, on what basis can you trust that the certificate was ever actually held by the party you want to reach, rather than an imposter? If a private key is lost or stolen, it can potentially be to impersonate the original holder, unless other systems are in place to rotate compromised credentials to keep them in accurate correspondence to identity.

The requirements for these ‘other systems’ to be secure are not simple or easy. They make up a complex set of social and technological problems which are collectively addressed today by the internet as a whole, as well as organizations across the world, by a mix of careful practices, social relationships, and digital technologies. This mixed solution is known as Public Key Infrastructure (PKI).

At its core, PKI is a hierarchy of cryptographically-backed trust relationships between several kinds of party:

  • Subscribers wish to prove their identity to others so they can offer secure, reliable services.
  • Relying parties wish to confidently identify and securely connect with subscribers.
  • Certificate Authorities (CA) are responsible for verifying the identity of subscribers. (which can include subordinate certificate authorities).

In practice, many connections can be either one-sided, meaning that only one party must prove its identity before an encrypted session is established, or two-sided (or mutual), meaning that both must prove their identities. In one-sided TLS authentication, only the server must have a CA-signed key pair; clients can access information without any authentication, using a private key that is randomly generated for the session. This works well when the client needs no authentication, for example, for a public read-only website or API. It also works when the user has another way to authenticate, such as a username/password combination, or a Single Sign-On (SSO) flow

In mutually authenticated TLS connections, each party must have a key pair issued by a CA that the other party trusts. This is commonly employed for internal components within a distributed system (such as nodes within a database or application server cluster), and for long-lived clients.

Certificates, signing, trust, and authority

The core mechanism of PKI is the PKI certificate, also commonly known as a “security certificate”, “digital certificate” or “TLS certificate” (because it is used in TLS), or abbreviated “cert”. In TLS the x509 certificate format is used.

A PKI certificate is a file containing the following:

  • A) A public key to be used for encryption/decryption and to validate cryptographic signatures.
  • B) Some metadata about:
    • the party that allegedly holds the corresponding private key and who therefore is the only one capable of decrypting messages encrypted with A, most importantly at least one name, such as a domain name in a domain name registry system (DNS). The certificate essentially functions as a badge or nametag, allowing the holder to claim to be the named party.
    • the party signing the certificate, and the certificate authority (if any) that signed its public certificate
    • the cryptographic signature created with the certificate authorities’ private key
  • C) A list of actions the holder of the certificate’s private key can be trusted to perform, if the certificate is trusted.

On its own, such a digital certificate is of no more value than a paper certificate. Perhaps less value, as it can be neither scribbled upon nor burned. However, if signed by a trusted certificate authority, a digital certificate can then be used as an indicator that the entity with the corresponding private key is also to be trusted.

Signing here means generating a hash of data (such as a certificate) that can be verified with the public key, using a function that is difficult to spoof without the corresponding private key. So how does that help? This is the point where the boundaries between computing systems and social systems become very murky. By signing a certificate (and anyone with a private key can do this) a party is acting as a “certificate authority”; they are in effect asserting the validity of the identity claim being made by the certificate holder.

The premise of PKI is that if I present you with a certificate that says I am a certain person, have a certain organizational affiliation, have a certain security role or status defined in some externally defined system, etc., than you can believe that if you trust the certificate authority. So certificate authorities become responsible for maintaining accurate records about who is who, and communicate this to relying parties by adding their signature of a party’s public key to the public certificate. Furthermore, one Certificate Authority may grant certify that another party is authorized to issue certificates on its behalf, acting as a subordinate CA. So the entire system can be described as a hierarchy of authority and trust, cryptographically documented with the mechanism of certificate signing.

The Web PKI

Internally, organizations must maintain their own trust architectures, deciding what parties (individual persons), should have access to what network, computing, data, financial, and physical resources, and using certificates or other means to authenticate identity and establish encryption. This will be explored in the next section with an example. On the public internet, certificate authority providers such as Identrust, Digicert, and Let’s Encrypt provide the role of root CAs (or “trust anchors”) to the entire system. What makes them “trust-worthy”? In practice, just the fact that they are trusted by the parties that distribute hardware and software (such as operating system distributions and browsers) packages that come pre-loaded with trust stores.

This is how your web browser or mobile app knows that it’s actually talking to your bank, rather than an imposter. When you visit the bank’s website, it display’s a public certificate signed by a CA. If your browser trusts the CA, then considers the website to be authenticated.

Makers of browsers such as Mozilla and Google actively run CA evaluation programs in order to maintain a list of trusted (and distrusted) CAs, in order to let their browsers determine whether or not to trust the certificate presented by a website when you visit. The certificate must belong to a chain of trusted certificate authorities that ultimately roots in a trusted CA. For an application that must perform certificate authentication, such as a browser or a database, trusting a CA means having it’s public certificate included in a list or directory known as the ‘trust store’.

A trust store is a collection of public certificates for trusted CAs—CAs whose signed certificates will be accepted for purposes of identity verification. When you use a hardware device or software package that comes loaded with a trust store, you are trusting the judgment of the company selected the CAs to add to the package’s trust store.

It is up to each hardware or software vendor to decide which root CAs to include in trust stores they distribute, and ultimately the end user decides which vendors to trust. CAs must comply with formalized industry standard baseline requirements to maintain good standing with vendors.

Revoking trust

Over time, employment and business relationships change, computing systems are deployed and destroyed, and even the most carefully-hidden passwords and private keys may be accidentally leaked or intentionally stolen. For all of these reasons, in order to maintain its trustworthiness, a certificate authority (CA) must be able to revoke guarantees it has issued in the form of signed certificates when those guarantees no longer hold.

The main solutions to this problem are:

  • Issuing only short-lived certificates, which are effectively revoked by discontinuing their rotation.
  • Certificate Revocation Lists (CRLs) allow relying parties to check if a subscriber’s certificate has been revoked.
  • The Online Certificate Revocation Protocol (OCSP) allows relying parties to check the status of certificates in real time.

Each strategy has pros and cons in terms of security and operational overhead.

Certificates and TLS are powerful tools, provided the trusted certificate authorities are properly protected (and the infrastructure that they exist in are carefully protected according to the principle of least privilege), certificates are only issued for systems that they should be, and the private keys involved are protected against exfiltration from the machines using them.

Case Study: TLS/PKI in CockroachDB

CockroachDB is a distributed SQL database, meaning it runs in parallel on a cluster of multiple physical (or virtual) computers—referred to as nodes, all of which must constantly exchange messages across a network to maintain data consistency and balance workloads. This is a nice example to illustrate TLS and PKI concepts because there are two kinds of communication, within-cluster messaging between nodes of the database as they synchronize their state and balance workloads, and requests to the cluster from outside clients, e.g. making SQL requests. For a more hands-on discussion, check out [this article] in which I walk through implementing an example database PKI using HashiCorp Vault.

In a CockroachDB cluster, each node must be able to initiate HTTP requests to any of the others. In a normal, secure operating mode, these requests must be TLS encrypted with mutual authentication. Therefore, each node must have a private key/public certificate pair, with the public certificates having all been signed by the same certificate authority CA–we’ll refer to this as the node CA. A node’s trust store is the set of CA public certificates contained in the directory specified by the --certs-dir argument when the node is started using cockroach start. For each CA public certificate in the trust store, the node will accept all valid certificates signed by the CA or any CA subordinate to it.

The cluster also must receive traffic from database clients, such as applications that need to access the data, and database admins that need to perform various functions. The client must authenticate the server with a certificate signed by a trusted CA, and the server must authenticate the client, either with its own certificate, or with another method, such as username/password, or with a single-sign on SSO token via Okta or another identity provider. App servers typically use certificates, whereas human users (admins) should use SSO when possible for the additional security benefits. If the client is to use mutual authentication the client (e.g. an app server) must have a private key/public certificate pair, where the public certificate is signed by a CA trusted by the nodes, i.e. the CA’s public certificate must be in each nodes’ trust store.

Note that that this CA does not need to be the same as the node CA. Both node-node and client-cluster traffic use TLS, but can, and in most cases should, depend on independent PKI hierarchies, i.e. they should use key-pairs signed by different CAs. To understand why, consider the different contexts:

First consider traffic between nodes within the database cluster. The nodes only ever need to receive traffic from other nodes, so in fact there is no need for internode traffic to be rooted in a PKI hierarchy that extends beyond the cluster. One way to implement this is to actually generate the certificates on a CA you create yourself within the network of the database cluster, on a dedicated CA jumpbox; while this is not appropriate in every situation, it is maximally secure in the sense that the no certs signed by the node CA ever needs to leave the infrastructure where they will be used. (Infrastructure here could refer to hardware servers, but more often to Google Cloud Projects or Amazon Virtual Private Clouds, potentially spanning across geographical regions of physical infrastructure.) A certificate identifying the holder as a node issued by the same CA will allow any CockroachDB node to join the cluster—meaning if a bad actor obtained a certificate they could add a new node to the cluster, potentially opening up a back door to a variety of exploits. Therefore, it makes sense to limit the scope of the CA to the minimum: just provisioning node certs for the cluster.

In contrast, a client certificate allows a database client to execute SQL statements as a given database user, specified on the certificate. Clients used by different applications may vary in level of trust and according level of access to data, and operate across both the internet. In this case, certificates may best be signed and provisioned by subordinate CAs within local organizations and/or infrastructure, with the subordinate CAs sharing a common signing CA.

Operators who deploy distributed systems (such as CockroachDB databases) must often provision and manage certificates on each node, implementing their own PKI security. This entails ensuring that credentials are carefully controlled, monitoring for signs of compromise, and mitigating the impact of potential credential leaks. Authorization for issuing credentials is particularly critical, and this includes issuing private key/public certificate pairs for servers and clients. Unmitigated compromise of either of these can have devastating business impact.