Securely Managing Credentials in the Amazon Cloud

Posted by David Panofsky on Fri 22 April 2016

The Problem

Any time a security credential (e.g. database password) is needed to identify one computer system to another, there is the problem of how to securely store that credential on the client-side of the connection. On the server-side the password is usually encoded with some sort of one-way hash, but on the client-side it needs to be accessible in plain-text for the authentication mechanism to function. Even if we make the assumption that this client computer system is secure and decide we can store it in plaintext on disk, there is the question of how to populate that disk in the first place.

In the past, when dealing with a relatively small number of computer systems, SysOps could type the credential in at install time and not worry too much about it, but in today's world where most DevOps engineers are responsible for hundreds, thousands, or more machines, this becomes entirely unfeasible. Even in small environments, this approach requires that engineers know the credentials. This raises all sorts of security concerns and presents a huge problem as people join or leave the organization.

Flawed Workarounds

Do nothing and trust your boundaries (enterprise GitHub, VPC, host-based firewalls, etc.)

The most simplistic approach is to ignore the problem and rely upon the assumption that your systems, firewalls, developers, and code repository are inherently secure and isolated enough to mitigate problems. I call this the "See no evil, hear no evil" approach, though others may more affectionately call it the "YOLO" strategy. Passwords are stored in plaintext on servers and in the source code. Some services even forego authentication altogether (Memcache, anyone?) relying on the assumption that firewalls and NACLs will ensure that only trusted services will ever be able to connect to them in the first place. This approach works amazingly well...until it doesn't, by which time comes back to haunt them. The nightmare scenario for organizations employing this scheme is a disgruntled former employee with a copy of the code on their personal laptop.

Bake key into an AMI and use encrypted userdata

The solution is to encrypt all sensitive information and distribute the ciphertext instead of the plaintext. Without access to the decryption key, the encrypted data is useless. The difficulty becomes where to store the master key to decrypt this data. One potential solution is to bake the decryption key into all systems provisioned in your environment. That way, ciphertext can be distributed to the systems via any number of techniques and the systems handle their own decryption of the data. Trouble comes when you need to change the master key. Now in addition to needing to change and re-encrypt all your credentials, you need to rebuild or somehow inject the new master key into all of your servers. It also becomes very difficult to protect that master key. A breach of any system on the network means that the master key is suspect. Any time somebody leaves the organization, you must assume that they might have saved a copy of that key.

Deal with it in Configuration Management: (Ansible Vault, Chef Encrypted Data Bags, Puppet Hiera-Eyaml, etc.)

Another common approach is to use secrets management tools built into existing configuration management systems. Essentially, the credentials are encrypted and stored within the configuration management system. The ciphertext is decrypted while a service is being configured and injected into configuration files on the servers. On the plus side, the number of places where the master key is needed is greatly reduced. When that master key needs to be changed, there are few places where that change needs to occur. Also, these tools are built into the configuration management system which is likely being used to manage all things in the environment, not just secrets. A downside of using this setup is that the master key still needs to be entered somewhere to gain access to the credentials. In an environment where development and local environments are built using configuration management, all of you developers potentially need access to the master key. Any environment using autoscaling infrastructure will either need to bake plaintext credentials into server images or have a central server with the master key to inject credentials into servers on launch. The master key is needed in fewer places, but not gone.

Enterprise Approach

In an enterprise datacenter, this problem is often addressed using a delegated trust system such as Kerberos. The specifics of how these systems function is beyond the scope of this article, but a simplified description follows. These systems rely on a centralized authentication service. Nodes running in the environment are registered into the central system at provisioning time. Subsequent node-to-node communication is authenticated by delegating the trust back to the original registration of the nodes themselves. Users of these systems are also registered into the central system which means that, for example, when user X needs to access resource A, which in turn needs to connect to resource B, all of this communication can be authenticated via delegation back to the original registration with the central system. The drawbacks to using such a solution include complexity of initial configuration and reliance upon the central service. If the central service is unavailable, no client-server and server-server communication can be initiated. For larger organizations, this system can make sense, but for smaller development teams, the maintenance overhead and server provisioning bottlenecks can be prohibitive.

Proposed Solution - AWS KMS

According to the product details page, "AWS Key Management Service gives you centralized control over the encryption keys used to protect your data. You can create, rotate, disable, delete, define usage policies for, and audit the use of encryption keys used to encrypt your data." It is similar to (and based upon) hardware security modules in that master keys are never exposed outside of the service itself. Instead, data to be encrypted or decrypted is passed over SSL to the service for processing. The KMS API allows one to encrypt, decrypt and re-encrypt data using either a master key or via auto-generated single-use data keys. By using data keys, the security of the system is enhanced because it helps limit the scope of a brute force attack.

Encryption Context

As explained on the AWS Security Blog post: How to Protect the Integrity of Your Encrypted Data by Using AWS Key Management Service and EncryptionContext, it is important to ensure that data encrypted with KMS includes the context in which it will be used. For example, a setting which represents the production django service's database password should be sent to KMS with an EncryptionContext of, for example, PROD_DJANGO_DB_PASSWORD. The same EncryptionContext must be passed to KMS at decryption time and it will be logged - in plaintext - in AWS’ CloudTrail. Using an EncryptionContext helps ensure that data cannot be tampered with or maliciously leaked since decryption requires the correct context. Imagine a situation where somebody modified the codebase to assign the PROD_DJANGO_DB_PASSWORD encrypted blob to the LEET_HAXOR_EMAIL_BODY variable and wrote a routine to send off an email the next time secrets are parsed. Without the EncryptionContext, KMS would happily decrypt the blob, and the server would assign and use this variable. With a properly configured EncryptionContext, KMS would try to decrypt the variable and fail because 'PROD_DJANGO_DB_PASSWORD'!='LEET_HAXOR_EMAIL_BODY'. Additionally, the LEET_HAXOR_EMAIL_BODY string would end up in CloudTrail which should help raise the alarm.

Do we trust AWS?

Whenever we consider using a service like KMS, we must decide whether we trust the correctness of its implementation. If we were to use an open-source solution, then, we could validate the architecture ourselves, assuming we have the in-house expertise and time to do so. With KMS, we don't even have the luxury of looking under the hood which may leave some engineers uncomfortable. I, however, am happy to delegate the responsibility of secure and reliable architectural design to Amazon. They can and do hire experts in the field to create services such as KMS, and have passed several third-party certification procedures.

But, what about the master key

As mentioned above, a master key never leaves the KMS system, so how is it that we can delegate encryption/decryption rights to users or systems? Like with other services in AWS, permissions in KMS are defined in Amazon's Identity and Access Management (IAM) service. Permissions are fine grained so it is possible to grant the permission to decrypt using a specific key to a specific user or group. The real magic of using KMS is if we grant decrypt permissions to a specific IAM instance profile rather than user or groups. This delegates the privilege to a group of hosts in EC2 and all access is made using automatically generated temporary access keys. No key is ever needed to be baked or loaded into a server's configuration. AWS' SDKs and client utilities handle the details for you. Essentially, the master key in this setup is delegated to a server's membership in a particular instance role.

Role Separation

This setup allows for easy separation of roles. E.g. developers can be granted access to en/decrypt data with a particular key, servers in one instance profile could be granted access to a key for decryption only. Release engineers could be given the ability to encrypt data using the "production" key, but not decrypt with the same key. It's a compliance officer's dream come true.

Key Rotation

By default, KMS rotates all master keys annually or on demand. Previous master keys are retained in case old data needs to be decrypted, but any data encrypted going forward will use the new keys. There is even a re-encrypt function which takes in one ciphertext encrypted with an old key and returns the data re-encrypted with the latest key.

Usage Logging

Amazon KMS is fully integrated into their CloudTrail API logging service. All calls to KMS including calls to the CreateKey, Encrypt, and Decrypt actions generate entries in the CloudTrail log files. These log entries include the IAM user identity as well as the plaintext for the EncryptionContext used in any Encrypt or Decrypt action. This extensive logging should allow auditing of both legitimate and illegitimate usage of the service. Alarm actions could be configured to raise the alarm whenever a KMS decrypt action fails with details about what the user was trying to do when the problem happened. If the system using KMS was designed to lazily decrypt the ciphertext on first use, these audit logs could even be used to determine which secrets are unused and can be safely removed.

Credential Rotation (treat users as ephemeral)

Changing passwords becomes fairly easy in this solution. First, a new password is generated and encrypted using the correct key and EncryptionContext. Next, the old ciphertext is replaced by the new ciphertext anywhere it is used. Finally, the password is updated on the backing service (e.g. database) and dependant services are reloaded. The new ciphertext is decrypted as needed and new connections established using the new password. The biggest headache is coordinating this change across a distributed architecture to ensure the new password is not used before it is valid and the old password is not used after. For organizations which can schedule maintenance windows or tolerate short partial outages, updating passwords this way can be relatively painless.

I'd actually propose a slightly different approach where both the user and password are treated as ephemeral. When it's time to change credentials, the procedure begins by creating a new user-password pair in the backing service. The new user is granted all the same permissions as the existing user. Next, the new password is encrypted and both the new plaintext username and ciphertext password are distributed to dependant services. At some future time, when the organization is certain that all dependant services are using the new credentials, the old user can be deleted from the backing service. This technique allows for a zero-downtime rolling update of the credentials in use without introducing much complexity to the procedure.

Encrypted Data Storage Location

The final decision to be made when using a service like KMS is where do we store the encrypted secrets themselves? Now that we don't need to bake any cryptographic keys or encrypted user data into our systems, we have the freedom to choose the most appropriate data store for our use case.

An existing configuration management system

An obvious choice for storing ciphertext blobs would be a preexisting configuration management system such as Ansible, Chef or Puppet. Using KMS to decrypt the ciphertext on the server gets us around the limitations mentioned above with these systems' built in encryption mechanism. Similarly, encrypted data could be stored in configuration systems such as Apache ZooKeeper, Consul or etcd. The decision to use one of these solutions would revolve mostly around whether they are already being used by the team.

S3

Another reasonable place to store ciphertext is in an Amazon S3 bucket. Object keys would specify the setting name and the content would be the ciphertext. Sneaker is a tool written in golang which implements this functionality. S3's permissions model allows one to lock down access to the bucket to further secure things. The main downside of S3 is that any tool which managed the content of these variable file might need to walk the entire contents of the bucket to list available variables...S3 does not have a notion of an index. For small installations, this would not be an issue, but eventually S3 might prove to be too slow for the task.

DynamoDB

To get around the limitation of S3, one could implement a DynamoDB backed solution. This is the approach taken by Credstash which is a python implementation of a KMS based credential storage system. DynamoDB has the same sort of permissions functionality as S3 but would not suffer the same scalability issue that S3 might. DynamoDB does, however, have a different drawback. Throughput needs to be provisioned in advance and comes in units of queries per second. I worry that at server startup time, the application may need to load many settings from DynamoDB which would quickly saturate the system. We would prefer a system optimized for bursty rather than consistent load.

In the code (GitHub)

By far, the simplest place to store credentials is in the codebase. This is also one of the worst things you can do...usually. As discussed above, storing plaintext credentials even in a private repository is a terrible idea because one thing goes wrong and that all of your passwords are out on the internet. Things change when we're talking about encrypted credentials. I would argue that from a security point of view, storing ciphertext in the codebase is an acceptable practice (as long as the keys are only usable by the appropriate set of people & instances). It certainly means you don't need to maintain any external system. One drawback (besides the snickering you'll get from your friends) is that you'll need to use some sort of settings overloading to change credentials between different environments.

DNS

I'm saving my favorite and possibly most controversial credential store for last: DNS TXT records. They would exist alongside standard DNS CNAME and A records as well as plaintext DNS TXT records which specify other dynamic settings in the environment. Keep in mind the fact that the DNS system is essentially a highly available, distributed database heavily optimized for read performance. It comes with integrated caching with a configurable per-record TTL and lookup mechanisms are built into every modern operating system. By design, DNS domains cannot be iteratively crawled over (one needs to know the record name to query it) and CNAME records can be used to point to a current record while expiring records can be kept around for an interim period. This meshes nicely with the credential rotation scheme I outlined above.

As long as the DNS domain is kept relatively private this solution has several other advantages. First of all, the configuration information which specifies a backing service's location and its credentials are saved in the same place. For example, my-db.priv could be an A record pointing to the IP address of a database and my-db-user.priv / my-db-pass.priv could be corresponding plain and encrypted DNS TXT records which, taken together, allow access to a given database (assuming appropriate KMS decrypt permissions are granted). Different environments can be associated with different DNS domains which means that looking up a given backing resource can use the same name for all environments, letting DNS handle the difference between development and production environments. Conveniently, AWS' DNS service, Route53, supports private domains which can be associated to particular VPCs. Using this setup would help ensure that systems wouldn't connect to inappropriate backing services whether intentionally or accidentally configured to do so.

Alternatives

There are a number of centralized alternatives to using KMS which can be used for managing credentials such as Hashicorp's Vault, Chef-Vault, Consul, Apache ZooKeeper, etc. The main drawback to these systems is they need to be properly configured, secured and maintained. If they are unavailable for some reason then your entire infrastructure is impacted. For highly dynamic or complex infrastructures, one of these systems will likely be needed. That said, I think it would not be worth the management overhead of running one of these systems just for managing credentials. One of the main reasons we rely on AWS is so we can focus on building business solutions rather than becoming specialists at operating infrastructure.

Thank you

At SinglePlatform, we are still in the discovery phase of managing credentials using KMS. I would greatly appreciate any feedback on the solution outlined above as well as any stories from teams who have tried to do something similar.


Comments !