Reliability engineering for some of top 10 sites in Scandinavia

Setup

Traditionally the developer teams and DevOps teamd live in slightly different worlds with different concerns

The Core team is not a dedicated DevOps team, but rather generalist software engineers with hands deep into the code and the responsibility to run the sites in production.

Many of these ideas did not start as “best practices” copied from some big American tech giant but organically grew in a cultivating environment:

  • Risk Appetite: the risk assessment has fundamental impact on any product aspect from architecture to tooling and processes. Media & entertainment products are relatively lean and quick to adopt new technologies and experimentation. This is in contrast to say finance (FinTech) or medicin (MedTech) where money and lives are at stake and there are heavy regulatory and compliance requirements in place.
  • Tooling: the software either produced by us or 3rd party open source or SaaS services to implement the above mindset

Mindset

Tight feedback accelerates innovation

One of the most effective things the Core team does is to be in close contact with the developer community and it’s extra important due to distributed and multinational nature of the community. We have an active slack support channel, good level of documentation and two-way communication channels like unconference and scoped workshops. When the pandemic started, we even experimented with office hours over VC where at least one person from the Core team was ready for face to face interaction.

Improving DX improves UX

When I was new to the Core team, I already had over a decade of software engineering experience but it still took me half a year of self-onboarding to be up and running. This was inefficient to say the lest and would particularly hurt new recruits. I have written more about my process here but in a nutshell, I made it my mission to improve the developer experience (DX).

Lift your developers and they will lift your users

Confused developers create confusing products. Frustrated developers create frustrating products. Productive developers create products that are a pleasure to use.

Example: as our repos grew larger and larger, one developer flagged the fact that the linter (running on pre-push hook) practically took 3–5 minutes on his machine effectively slowing down his development speed. Although his machine was old, the ultimate solution was smarter configuration.

Dogfooding feeds the product

The Core team works on the same repo as the rest of the community with the same tools and is exposed to the same pain points. Admittedly there are some discrepancies like the on-call responsibility and access to some sensitive systems for the on-call rotation we’re in this together.

Eating your own dog food or dogfooding is the practice of an organization using its own product. — wikipedia

The Core team even has a mock site that puts us in the shoes of the site developers building on top of our platform. Not only it exposes us to some of the pain points of the users of the platform, it naturally forces us to practically understand the use cases of the sites built on top of the platform as well because our dogfooding site practically puts us in the position of a brand developer.

the core team also maintains a dog fooding site

We ARE the developers! If there’s ever an “us vs them” we have failed our role.

As the Core team is a stakeholder in the platform, many code improvements originate here:

  • good observability tools when running the code in the cloud
  • good documentation that nails the “why” of a problem/solution
  • clean code which makes the “how” of a solution accessible to the average developer
  • tooling that shields the team from bikeshedding

Local innovation and shared effort

The holly grail of a white label product like ours is:

  • to reduce duplication of effort: let the brands share those innovations and unify on common solutions
  • Each brand has their own “space” where they can override defaults or create totally new behavior. In practice that “space” is usually a folder tree.

Frequent deploy and quick Rollback

I’ll just quickly mention these two without going into technical details:

  • Quick Rollback: reduce the cost of making mistakes, hence make it more likely to make mistakes and learn from them. For the terminology geek: we use rollforward in some systems and rollback in some others. The key point here is to be able to quickly go to a known good state.
The harder it is to deploy the code, the less it will happen

Reduce toil

Good software engineers are smart, great ones are also lazy! We have put a lot of time into automating the repetive, error prone tasks known as toil.

Tooling

Now let’s see some of the more concrete ways we use to improve security, reliability and scalability.

Unified observability platform

A good observability platform is the most important tool for guaranteeing site reliability and performance.

A good observability tool is bigger than the sum of its pillars
Cost limits help keep the value larger than the cost

Automated Code Analysis

We use many tools to statically analyze the code. Some of which includes:

  • ESlint is a popular tool for avoiding common JavaScript pitfalls and promoting good coding practices all while reducing bikeshedding discussions like tabs vs spaces and semicolons.
  • Prettier specially when integrated with the IDE, can reduce linting errors, improve code quality and save developer time by automatically formatting files.
  • Lint-staged is a smart and useful tool that can reduce the number of linted files and accelerate the pre-commit hook hence improving DX
  • Commit Lint is a way to ensure the commit messages comply to a format that is easy to digest for humans and tools.
  • Husky is a tool for running Node.js scripts upon various git hooks. This is what runs Commit lint or Lint staged for example.
  • Lighthouse is a tool for improving the quality of web pages by auditing performance, accessibility and SEO metrics. We benchmark every PR using lighthouse and show changes from its parent.
  • We extensively use Github code owners to automatically notify the right people for each PR based on the files it modifies. At our scale (both code and community), we quickly grew to use tools like codeowners-generator to generate the final code owner file. To feed its initial config, we created a script which extracts good candidates for each folder based on git history.

Accessible Security

Security is hard but with a bit of creativity one can make it accessible to a larger group of people. One of the most common issues (OWASP top 10) is using components with known vulnerabilities

  • Renovate is a tool that regularly scans the repositories for outdated dependencies and if something needs to be updated, it’ll make a PR ready to merge to master. Dependobot is another tool. It worth to note that these tools work best when the code has good test coverage which can flag any unaccepted behaviour that may result from an automated dependency update.
  • Vulcan is a meta tool which regularly scans services across the organization for security vulnerabilities and sends brief actionable reports to stake holders.

Use managed services with higher SLA

When a service response changes less often than it is requested, it’s probably a good candidate for caching. Throughout the platform we use various types of caches from in-memory to Redis to CloudFront and Fastly.

Varnish

Varnish is a widely used HTTP accelerator that originated at VG. It is also the base of Fastly. The key benefits of Varnish are:

  • Reduce perceived down time: one of the neat features of Varnish is to serve stale traffic when a backend goes down
  • Isolate access control to the edge layer: Varnish handles access control without requirind any special logic from the content rendering backend
  • GDPR-ready by design: the end user personally identifiable information (PII) doesn’t reach our backend because Varnish strips it off.
A reverse proxy like Varnish can be configured to reduce the blast radious of service outage

Feature flags

Another popular open source product that started at our company is Unleash. It allows gradually releasing new features or doing A/B testing in production. This provides greater flexibility and more control over exposing new features and can help us make data driven decisions and experimentation. The implementation details are outside the scope of this article.

Conclusion

In the spirit of transparency, this article does not claim that we have found the holly grail of site reliability engineering. But it elaborates how we got to a good level.

Handling both software engineering and operation in one team can be very taxing for the team
  • Putting a lot of effort into improving DX and reliability also means that we are too invested into the status quo to change it. Code ruts over time and even more so when the repo is larger. The fact that we did not let these huge repos implode under their own load, allowed some of the crooked ugliest hacks to survive over the years. Of course one can refactor the tech debt but it won’t be prioritized when the risk is lower due to site reliability engineering.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store