A lightweight auditable config system
Any complex and flexible algorithm reacts to configs to customize software behavior. In this article we look at different types of config and when to use each. We also touch on a light weight auditable runtime configuration system that has been in production for several years configuring millions of web requests per day.
The simplest and most predictable configs are the ones that are statically baked into the code. These are usually easy to unit test because they don’t change after the software is deployed.
Static configs provide great predictability. For example in the case of an incident, one can one can deploy a the last known good commit with the guarantee that the last known good configs will be uses as well. This reduces the mean time to resolution (MTTR).
Whenever possible, we recommend this type of config, but they have their limitations.
Deployment configs work best for variables that are tightly coupled to the concept of a deployment. For example:
- Upstream URLs which might be different depending on which stage the application is (eg. staging, production, etc.)
- Database credentials which might be different from region to region
- Sidecar information about how to use an decoupled adjacent service for example the metrics agent
The 12 factor app recommends using environment variables for everything that is likely to vary between deployments.
Deployment configs provide great reliability. For example one can point the software to a different database that is used for testing purposes without having to rebuild the software. Testing the service against a sample load can reduce mean time between failures (MTBF).
The business logic may need to change on the fly without having to rebuild and redeploy the software. This can include:
- Feature flags: to enable/disable or adjust a certain feature of the software
- Settings: required configs for the internal components of the software
There are many ways to configure the software at runtime. For example:
- Some services rely on a specially formatted header that is injected into every request at the edge layer
- Some others rely on a purpose built off the shelf software like Unleash or SaaS solutions like Configit or Hosted Unleash
- Some companies build their own solution
Runtime configs provide great flexibility at the expense of predictability and reliability. It is best to minimize their usage and resort to static or deploy configs whenever possible.
For example if a feature flag is set for all requests, it’s better to convert it to static config. Some tools make it easy to spot such redundant configs while others require manually digging in the code.
A lightweight implementation
Although configurability can increase software complexity, the config system is not a very complex piece of software.
Let’s say we have a few microservices all relying on a common config. They need to fetch the config at runtime with the following requirements:
- Reliability: since the config is external to the software, we need elaborate verification and constraints in place to make sure that a change does not break the services in production.
- Availability: the config should be available across regions when needed. New instances may fetch it while the old ones may poll it for any changes.
- Auditability: it should be easy to see who changed what config and when
JSON is a common format that works across architectures. It is safe to assume that our microservices will be able to fetch and parse configs in this format.
Git is the most common version control system, which is familiar to most developers who build services and set those configs. If the configs are a bunch of JSON files, it is easy to keep track of who changed what and when. The config can be put in a git repo and all changes can come in the form of a PR.
AWS is one of the most popular cloud platforms and Simple Storage Storage is one of their oldest and most reliable services. It is a global service that works across regions and can be configured to expose a simple web server exposing JSON files. ETAG is supported out of the box which allows the clients (those microservices) to save resources by only fetching and parsing the config when it is actually changed. With WAF one can control which services have access to the configs.
At a high level the architecture looks like this:
The workflow for changing a config is:
- Clone or update the repo on a local machine
- Change the necessary configs and run verification test suit
- Preferably test an instance of the consumers (microservices) against the local config and verify the behavior
- Make a PR and get it merged
- Upon merge to master, the config is pushed to S3 where it is readily available for microservices to fetch
- Microservices which have the config check the ETAG of their last stored config against what’s available on S3 and fetch if it the ETAGs don’t match
In our experience from the moment the config is merged to master till it is available in production takes less than 1 minute and if the microservices poll for an ETAG change every 3 minutes, the maximum time it takes for a config to be “live” is 4 minutes. Obviously the poll interval can be reduced or even a SQS queue can be used together with SNS to notify the consumers of the change, which dramatically reduces the time it takes for the config to be “live”.
- In our setup, only a tiny fraction of the configs needed to be adjusted so we created a GUI on top of the git repo which would allow non technical people to edit those configs. Apart from the cost of protecting and running that GUI, we were missing the audit feature of Git which by that time could easily be replaced by something like MongoDB.
- Travis has built-in support for uploading to a S3 bucket.
- As our config grew, we broke it into a directory structure. An open source project would combine them into one JSON ready for testing and deployment.
In general it is best to reduce the runtime configuration because it is hard (if not impossible) to guarantee correctness for all permutations of config.
If the config is growing wild it is usually a symptom of organizational issues. One can hardly solve organization issues with technical solutions. An experienced PM who is good with stakeholder management should be able to shield the team from unnecessary flexibility. It is PM’s job to settle conflicting requirements and distill the implementation requirements. Keep the configs limited to what moves the business metric needle and justify it against the cost of flexibility.