Reliability engineering for some of top 10 sites in Scandinavia
I work at the Core team behind some of the top 10 most visited sites in Scandinavia (Sweden & Norway). The product itself is a white label platform to build news sites. Its implementation is governed as an inner source with a community of around 150+ employees and consultants (I have previously written about how this community came to be).
In less than 3 years we went from 1 site to 20+. This growth had deep implications from the way we collaborate to the way we ensure scalability, security and reliability.
Before digging into the processes, one fact needs to be emphasized:
The Core team is not a dedicated DevOps team, but rather generalist software engineers with hands deep into the code and the responsibility to run the sites in production.
Many of these ideas did not start as “best practices” copied from some big American tech giant but organically grew in a cultivating environment:
- Company culture: we enjoy tremendous autonomy that enables “local innovation” at the team level but spreads thanks to effective collaboration to avoid duplication of efforts (more on that later in this article).
- Risk Appetite: the risk assessment has fundamental impact on any product aspect from architecture to tooling and processes. Media & entertainment products are relatively lean and quick to adopt new technologies and experimentation. This is in contrast to say finance (FinTech) or medicin (MedTech) where money and lives are at stake and there are heavy regulatory and compliance requirements in place.
Given the crucial role of the Core team for the company’s business, some of the most senior and talented developers are recruited to this team both internally and externally. Initially my team was responsible for architecturing, implementing and running the platform in production. As the platform became battle-tested in production, most contributions come from the inner source community these days. The Core team implements code governance, improves developer experience (DX), builds satellite services supporting the main services and generally takes care of observability, performance, reliability, scalability, security and architecture.
Let’s have a look at some of the measures we have taken to ensure the well-oiled machinery behind some of the top sites in Scandinavia. This list is by no mean exhastive or inclusive but to give it a bit of structure, the ideas are broadly grouped into two main categories:
- Mindset: the mentality & practices that are built around governing the community, innovation, security and reliability
- Tooling: the software either produced by us or 3rd party open source or SaaS services to implement the above mindset
Tight feedback accelerates innovation
One of the most effective things the Core team does is to be in close contact with the developer community and it’s extra important due to distributed and multinational nature of the community. We have an active slack support channel, good level of documentation and two-way communication channels like unconference and scoped workshops. When the pandemic started, we even experimented with office hours over VC where at least one person from the Core team was ready for face to face interaction.
The tight feedback loop enables us to quickly react to issues and have a lower barrier for good ideas to be adopted. Moreover, the Core team is an actual user of the platform which exposes us to the same pain points and enables us to have deep technical discussions and implement ideas. This is in stark contrast with DevOps teams who live in a completely different world using different software tools and having different concerns.
Although theoretically operations and developers are supposed to work closely together to make the DevOps work, in practice, the two professions attract different types of people equipped with different tools, having different priorities and speaking different languages which makes it very hard to collaborate effectively. I’m not saying it’s impossible, but in my experience it’s more of an exception than the norm.
Some of the most useful feedback we received for improving the platform was from our Unconference sessions.
Unconference is an effective tool to get feedback. In a nutshell, you allocate a space (physical or virtual), and invite all the relevant people. Everyone can bring their topic of discussion and pitch it. An agenda is created organically and discussions are held in break out sessions. At the end, there’ll be a summary presented from each session. It’s like a conference but without a pre-decided agenda and 1:n information flow. It allows spontinuity while having a consensus element to it.
During the pandemic it is a bit more complex to plan and carry a digital unconference but it’s much cheaper and leaner for distributed teams because it doesn’t need everyone to be at the same location at the same time after clearing up their calendar for a day or two. Plus, by preventing tens or hundreds of trips, it reduces the carbon footprint.
Improving DX improves UX
When I was new to the Core team, I already had over a decade of software engineering experience but it still took me half a year of self-onboarding to be up and running. This was inefficient to say the lest and would particularly hurt new recruits. I have written more about my process here but in a nutshell, I made it my mission to improve the developer experience (DX).
Back then whenever I used the acronym DX, I had to expand it for people who never heard of “Developer Experience”. Today it makes me so happy to see that my colleagues throw the term here and there as the reason to improve the platform. It even made its way to the OKR!
So how does improving DX improve product quality? Bugs shelter in complexity and multiply in brainless copy/paste. Having a good understanding of the code enables the developers to create cleaner code with higher quality and less vulnerabilities. The developers spend less time to fight the tools and more time to learn the domain knowledge.
There are many ways to put it but to be brutally honest:
Confused developers create confusing products. Frustrated developers create frustrating products. Productive developers create products that are a pleasure to use.
Example: as our repos grew larger and larger, one developer flagged the fact that the linter (running on pre-push hook) practically took 3–5 minutes on his machine effectively slowing down his development speed. Although his machine was old, the ultimate solution was smarter configuration.
Dogfooding feeds the product
The Core team works on the same repo as the rest of the community with the same tools and is exposed to the same pain points. Admittedly there are some discrepancies like the on-call responsibility and access to some sensitive systems for the on-call rotation we’re in this together.
Eating your own dog food or dogfooding is the practice of an organization using its own product. — wikipedia
The Core team even has a mock site that puts us in the shoes of the site developers building on top of our platform. Not only it exposes us to some of the pain points of the users of the platform, it naturally forces us to practically understand the use cases of the sites built on top of the platform as well because our dogfooding site practically puts us in the position of a brand developer.
This is where we diverge from a traditional DevOps team which is mainly focused on running the code with the ambition to keep in touch with the developers.
We ARE the developers! If there’s ever an “us vs them” we have failed our role.
As the Core team is a stakeholder in the platform, many code improvements originate here:
- quick feedback loops when running the code locally
- good observability tools when running the code in the cloud
- good documentation that nails the “why” of a problem/solution
- clean code which makes the “how” of a solution accessible to the average developer
- tooling that shields the team from bikeshedding
Local innovation and shared effort
The holly grail of a white label product like ours is:
- to support local innovation: let the brands modify the system behaviour or add new features on top of it to experiment solving new problems
- to reduce duplication of effort: let the brands share those innovations and unify on common solutions
This is of course easier said than done, but after several years of working with tens of brands, the secret sauce seems to be to have a layered architecture:
- The core team is responsible to oversee the general architecture and maintain the common parts of the code
- Each brand has their own “space” where they can override defaults or create totally new behavior. In practice that “space” is usually a folder tree.
Frequent deploy and quick Rollback
I’ll just quickly mention these two without going into technical details:
- Frequent Deploy: allow us to see the effect of code change in smaller batches. A smaller number of commits going to prod on each deploy, reduces the risk of things going terribly wrong but also makes it easier to detect what commit may potentially be the cause of an issue.
- Quick Rollback: reduce the cost of making mistakes, hence make it more likely to make mistakes and learn from them. For the terminology geek: we use rollforward in some systems and rollback in some others. The key point here is to be able to quickly go to a known good state.
We’ve put a lot of time into optimizing our deploy pipeline. We define lead time as the time it takes for a change from being merged to master till it’s live in production. 3 years ago, it was about an hour. Today it is often shorter than 10 minutes.
Currently we don’t automatically deploy to production. But we’ve made the manual judgement step as painless and straightforward as possible. To cut the learning curve and simplify the deployment, we brought the deployment workflow to the developers. We created a simpe chatbot which informs the developers in relevant channels about availability of a new version of the code with links to the staging endpoints. Promoting to production is just a click away. If things break, one can promote an earlier version. It’s all in the chat channel after all!
Good software engineers are smart, great ones are also lazy! We have put a lot of time into automating the repetive, error prone tasks known as toil.
One example is our developer credential self service. For security and auditability reasons each developer requires unique credentials to access our internal endpoints. Traditionally thse credentials were created manually by a member of the Core. This process was error prone, slow and distracting at least one person from Core. Besides this process made it expensive to rotate credentials and complex to disable deprecated ones (when someone left the company).
So we made a self service credential management portal which protected behind the company SSO solution. This allowed anyone to create/delete credentials or rotate keys.
Now let’s see some of the more concrete ways we use to improve security, reliability and scalability.
Unified observability platform
A good observability platform is the most important tool for guaranteeing site reliability and performance.
Trying to improve reliability without good observability tooling is like running in the dark. Our unified observability platform has a flexible monitoring and alert system which reduces our incident Mean Time To Acknowledge (MTTA).
Our tool supports all 3 pillars of observability (logs, metrics and traces) and correlates them to facilitate triage and troubleshooting across systems. Its wide range of integrations enable us to gather data points from many different systems. It uses machine learning to detect data seasonality and metric correlations which is a huge help for cutting our Mean Time To Resolve (MTTR).
There are many providers in this area and most of them cover most use cases. You can even go the DYI route and create something with free open source software like Prometheus, InfluxDB and Grafana. We use Datadog because quite frankly it’s much more expensive for us to build something like that and guarantee its security and reliability even at our scale. We have dedicated notebooks and dashboards for handling incidents and even visualize some key dashboards at the office.
When choosing or building an observability platform it is important to keep its total cost of ownership (TCO) in mind because with a bit of misconfiguration, misunderstanding the price model or bad luck (eg. DDoS), the costs can quickly outgrow the value of the tool.
Many cloud tools have a usage-based price model and to mitigate billing surprises they impose a usage or price cap that should be manually lifted. Examples are AWS or New Relic. Datadog however elaborates that limiting the capabilities of their system by cost may render it useless exactly when it is needed the most (eg. under unusually heavy load). Fortunately we had their representatives come over several times and help our community understand their pricing model but also giving tips and tricks for keeping the costs under control. Besides they can look into billing surprises on a case by case basis.
Automated Code Analysis
We use many tools to statically analyze the code. Some of which includes:
- Prettier specially when integrated with the IDE, can reduce linting errors, improve code quality and save developer time by automatically formatting files.
- Lint-staged is a smart and useful tool that can reduce the number of linted files and accelerate the pre-commit hook hence improving DX
- Commit Lint is a way to ensure the commit messages comply to a format that is easy to digest for humans and tools.
- Husky is a tool for running Node.js scripts upon various git hooks. This is what runs Commit lint or Lint staged for example.
- Lighthouse is a tool for improving the quality of web pages by auditing performance, accessibility and SEO metrics. We benchmark every PR using lighthouse and show changes from its parent.
- We extensively use Github code owners to automatically notify the right people for each PR based on the files it modifies. At our scale (both code and community), we quickly grew to use tools like codeowners-generator to generate the final code owner file. To feed its initial config, we created a script which extracts good candidates for each folder based on git history.
Security is hard but with a bit of creativity one can make it accessible to a larger group of people. One of the most common issues (OWASP top 10) is using components with known vulnerabilities
- NPM Audit is an essential step to ensure that code with known vulnerabilities does not go to production. We use it in our CI/CD pipeline but of course this tool is only useful if it creates clear actionable errors. One way to reduce random dev dependencies from breaking CI/CD is to only audit production dependencies above a certain risk level:
npm audit --production --audit-level=moderate. It’s also a good idea to automatically run run
npm audit fixif the package.json is changed.
- Renovate is a tool that regularly scans the repositories for outdated dependencies and if something needs to be updated, it’ll make a PR ready to merge to master. Dependobot is another tool. It worth to note that these tools work best when the code has good test coverage which can flag any unaccepted behaviour that may result from an automated dependency update.
- Vulcan is a meta tool which regularly scans services across the organization for security vulnerabilities and sends brief actionable reports to stake holders.
Use managed services with higher SLA
When a service response changes less often than it is requested, it’s probably a good candidate for caching. Throughout the platform we use various types of caches from in-memory to Redis to CloudFront and Fastly.
But sometimes, you can skip the service altogether. This is an idea that was popularized with Jamstack which serves statically generated HTML files instead of running a service that generates those files.
For us, without going too much into details, this translates to using solid services like S3 as the primary endpoint for some data that doesn’t change that often.
- Reduce server load: varnish is an extremely efficient cache server which reduces the running cost of the full blown servers
- Reduce perceived down time: one of the neat features of Varnish is to serve stale traffic when a backend goes down
- Isolate access control to the edge layer: Varnish handles access control without requirind any special logic from the content rendering backend
- GDPR-ready by design: the end user personally identifiable information (PII) doesn’t reach our backend because Varnish strips it off.
Another popular open source product that started at our company is Unleash. It allows gradually releasing new features or doing A/B testing in production. This provides greater flexibility and more control over exposing new features and can help us make data driven decisions and experimentation. The implementation details are outside the scope of this article.
In the spirit of transparency, this article does not claim that we have found the holly grail of site reliability engineering. But it elaborates how we got to a good level.
- A generalist team of software engineers who is responsible for both implementing the platform and running it, has some disadvantages too: the wider concern is more demanding both in terms of support and knowledge level which can be quite exhaustive. In practice, some of us end up doing mostly DevOps, while some others mostly do regular feature development or bug fixed in the repo. Nevertheless, we don’t get deep in any of those two as much as we need to.
- Putting a lot of effort into improving DX and reliability also means that we are too invested into the status quo to change it. Code ruts over time and even more so when the repo is larger. The fact that we did not let these huge repos implode under their own load, allowed some of the crooked ugliest hacks to survive over the years. Of course one can refactor the tech debt but it won’t be prioritized when the risk is lower due to site reliability engineering.