Reliability engineering for some of top 10 sites in Scandinavia

Setup

Traditionally the developer teams and DevOps teamd live in slightly different worlds with different concerns

The Core team is not a dedicated DevOps team, but rather generalist software engineers with hands deep into the code and the responsibility to run the sites in production.

  • Company culture: we enjoy tremendous autonomy that enables “local innovation” at the team level but spreads thanks to effective collaboration to avoid duplication of efforts (more on that later in this article).
  • Risk Appetite: the risk assessment has fundamental impact on any product aspect from architecture to tooling and processes. Media & entertainment products are relatively lean and quick to adopt new technologies and experimentation. This is in contrast to say finance (FinTech) or medicin (MedTech) where money and lives are at stake and there are heavy regulatory and compliance requirements in place.
  • Mindset: the mentality & practices that are built around governing the community, innovation, security and reliability
  • Tooling: the software either produced by us or 3rd party open source or SaaS services to implement the above mindset

Mindset

Tight feedback accelerates innovation

Improving DX improves UX

Lift your developers and they will lift your users

Confused developers create confusing products. Frustrated developers create frustrating products. Productive developers create products that are a pleasure to use.

Dogfooding feeds the product

Eating your own dog food or dogfooding is the practice of an organization using its own product. — wikipedia

the core team also maintains a dog fooding site

We ARE the developers! If there’s ever an “us vs them” we have failed our role.

  • quick feedback loops when running the code locally
  • good observability tools when running the code in the cloud
  • good documentation that nails the “why” of a problem/solution
  • clean code which makes the “how” of a solution accessible to the average developer
  • tooling that shields the team from bikeshedding

Local innovation and shared effort

  • to support local innovation: let the brands modify the system behaviour or add new features on top of it to experiment solving new problems
  • to reduce duplication of effort: let the brands share those innovations and unify on common solutions
  • The core team is responsible to oversee the general architecture and maintain the common parts of the code
  • Each brand has their own “space” where they can override defaults or create totally new behavior. In practice that “space” is usually a folder tree.

Frequent deploy and quick Rollback

  • Frequent Deploy: allow us to see the effect of code change in smaller batches. A smaller number of commits going to prod on each deploy, reduces the risk of things going terribly wrong but also makes it easier to detect what commit may potentially be the cause of an issue.
  • Quick Rollback: reduce the cost of making mistakes, hence make it more likely to make mistakes and learn from them. For the terminology geek: we use rollforward in some systems and rollback in some others. The key point here is to be able to quickly go to a known good state.
The harder it is to deploy the code, the less it will happen

Reduce toil

Tooling

Unified observability platform

A good observability tool is bigger than the sum of its pillars
Cost limits help keep the value larger than the cost

Automated Code Analysis

  • TypeScript brings a type system to JavaScript which makes it possible to tame large code bases where multiple people work on different parts of the system simultaneously. I’ve written more about the merits of TS in another post.
  • ESlint is a popular tool for avoiding common JavaScript pitfalls and promoting good coding practices all while reducing bikeshedding discussions like tabs vs spaces and semicolons.
  • Prettier specially when integrated with the IDE, can reduce linting errors, improve code quality and save developer time by automatically formatting files.
  • Lint-staged is a smart and useful tool that can reduce the number of linted files and accelerate the pre-commit hook hence improving DX
  • Commit Lint is a way to ensure the commit messages comply to a format that is easy to digest for humans and tools.
  • Husky is a tool for running Node.js scripts upon various git hooks. This is what runs Commit lint or Lint staged for example.
  • Lighthouse is a tool for improving the quality of web pages by auditing performance, accessibility and SEO metrics. We benchmark every PR using lighthouse and show changes from its parent.
  • We extensively use Github code owners to automatically notify the right people for each PR based on the files it modifies. At our scale (both code and community), we quickly grew to use tools like codeowners-generator to generate the final code owner file. To feed its initial config, we created a script which extracts good candidates for each folder based on git history.

Accessible Security

  • NPM Audit is an essential step to ensure that code with known vulnerabilities does not go to production. We use it in our CI/CD pipeline but of course this tool is only useful if it creates clear actionable errors. One way to reduce random dev dependencies from breaking CI/CD is to only audit production dependencies above a certain risk level: npm audit --production --audit-level=moderate. It’s also a good idea to automatically run run npm audit fix if the package.json is changed.
  • Renovate is a tool that regularly scans the repositories for outdated dependencies and if something needs to be updated, it’ll make a PR ready to merge to master. Dependobot is another tool. It worth to note that these tools work best when the code has good test coverage which can flag any unaccepted behaviour that may result from an automated dependency update.
  • Vulcan is a meta tool which regularly scans services across the organization for security vulnerabilities and sends brief actionable reports to stake holders.

Use managed services with higher SLA

Varnish

  • Reduce server load: varnish is an extremely efficient cache server which reduces the running cost of the full blown servers
  • Reduce perceived down time: one of the neat features of Varnish is to serve stale traffic when a backend goes down
  • Isolate access control to the edge layer: Varnish handles access control without requirind any special logic from the content rendering backend
  • GDPR-ready by design: the end user personally identifiable information (PII) doesn’t reach our backend because Varnish strips it off.
A reverse proxy like Varnish can be configured to reduce the blast radious of service outage

Feature flags

Conclusion

Handling both software engineering and operation in one team can be very taxing for the team
  • A generalist team of software engineers who is responsible for both implementing the platform and running it, has some disadvantages too: the wider concern is more demanding both in terms of support and knowledge level which can be quite exhaustive. In practice, some of us end up doing mostly DevOps, while some others mostly do regular feature development or bug fixed in the repo. Nevertheless, we don’t get deep in any of those two as much as we need to.
  • Putting a lot of effort into improving DX and reliability also means that we are too invested into the status quo to change it. Code ruts over time and even more so when the repo is larger. The fact that we did not let these huge repos implode under their own load, allowed some of the crooked ugliest hacks to survive over the years. Of course one can refactor the tech debt but it won’t be prioritized when the risk is lower due to site reliability engineering.

--

--

--

Sr. Staff Engineer, Knowledge Worker, MSc Systems Engineering, Tech Lead, Web Developer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Dead code

setting up the environment for apache spark source code debugging and contributing to it.

Top 10 common API testing interview questions (for Beginner and Intermediate level)

Face Detection with ML KIT

Why Startups Choose Flutter for App Development?

Mobile App Development Agency

How to Generate BarCode in Laravel?

reBaked AND THE PRINCIPLE OF COLLABORATION

Complete AWS Lambda Handbook for Beginners (Part 2)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Ewerlöf

Alex Ewerlöf

Sr. Staff Engineer, Knowledge Worker, MSc Systems Engineering, Tech Lead, Web Developer

More from Medium

The Art of Automating Automation

Prepping for traffic return @BookMyShow

Engineer On-Call: The Dos and Don’ts

Some Interesting Conferences and Talks on Cloud, Distributed Systems and Networking