The ownership trio
The success of “you build it, you run it” revolves around ownership. Ownership has 3 aspects that go hand in hand. If one or more are missing the ownership trio is broken and ultimately the team and the business suffer.
Let’s see what those aspects mean:
- Have a good grasp of the domain knowledge and the use cases that the product supports
- Know the technology that the product is based upon including the runtime and infrastructure that serves it in order to be able to reason about the system behavior
- Understanding the big picture architecture and how different pieces of solution interact with each other (this knowledge comes handy when doing a triage or root cause analysis)
- Be able to interpret the observability data (logs, metrics, traces) and know how to use it to get the pulse of the system as well as incident root cause analysis
- Understanding the docs (eg. API docs, runbooks, system diagrams, user docs, etc.)
- Have access to read customer feedback (eg. App Store reviews, UX research results, etc.)
- Understand the value of experimentation and how to carry experiments to validate hypothesis
- Be able to analyze the result of A/B testing, user research, etc.
- Having control over defining observability requirements (metrics, logs, traces) as well as access to the data
- Having control over defining the service level objectives (SLO) and server level agreements (SLA)
- Having control over evolving the architecture as the requirements grow or change
- Having control over changing the configs, code and infrastructure as code (IaC) without having to go through other teams
- Having adequate access to the relevant systems in accordance to principle of least privilege
- Having control over deciding the tradeoffs based on the impact and consequences (this mandate comes handy in the heat of the battle when dealing with an incident for example)
- Decide how to best react to customer reviews
- Has the power to approve/disaprove initiatives including hypothesis and tests (eg. A/B testing)
- Be responsible for instrumenting the observability tooling and keeping it up to date and functioning
- Be responsible for on-call duty as well as addressing incidents
- Be responsible for when the error budget is burned and taking proper actions (eg. blocking deploys till more budget is available)
- Have access to the right dashboards to be able to diagnose any issues and be able to update them as needed
- Have access to the runbooks or any automation in place for troubleshooting and be able to update them as needed
- Be the point of contact for supporting the service
- Be held accountable for customer reviews
- Is responsible for reacting to the data that comes out of testing hypothesis (eg. usability tests, user research, A/B testing, etc.)
Broken ownership trio
Now let’s look at the scenarios where the ownership trio is broken. Please note that we focus on one side of the story in each scenario which implies that the other sides are also suffering. For example if “mandate” is in one team, the “knowledge” and “responsibility” may be in one or two other teams. All of these are discussed below:
This is a monkey with gun scenario where one has the mandate to make the decisions without having to understand the system or being held responsible for its consequences. Top-down hierarchical organizational cultures are susceptible to this type of broken ownership. It hurts the productivity of the teams who have knowledge about the system and/or are responsible for when things break.
This is a coma scenario where the knowledge doesn’t connect to any control or responsibility.
Ideally demonstrating the knowledge leads to earning trust and accepting responsibilities which may grow to having the mandate. But there is a catch 22: without having any mandate, it’s hard to demonstrate the knowledge.
This is a baby parent scenario. Being responsible for something without having a good understanding or any control over it at best leads to a passive attitude and at worst it’ll lead to burnout. For example a less experienced product team may hire consultants or an external team to ship their code and cover their on-call needs. No matter how smart the responsible party is, without having a good grasp of the product and the mandate to evolve it, their efforts are a temporary hack at best and futile at worst.
Mandate + Knowledge but no Responsibility
This is a typical teenager scenario! In this scenario, the ones having the knowledge can make decisions without being held accountable for the consequences. The lack of responsibility allows the team to make riskier decisions which can hurt the business.
At best it leads to a culture of finger pointing and at worst it’ll collapse the business which cannot fulfill its obligations towards the customers.
Over time it actually hurts the knowledge aspect because they fail to learn from the failures that are exposed to the party who assumes the responsibility aspect. (see “Only responsibility” above)
Mandate + Responsibility but no Knowledge
This is a gambler scenario where those with mandate and responsibility don’t have enough knowledge to make the right decisions or understand their responsibility. Unfortunately this is more common for operations or infrastructure teams. This hurts productivity by slowing down and frustrating the team that has the knowledge (see “Only knowledge” above).
Knowledge + Responsibility but no Mandate
This is a foot soldier scenario where the ones with the knowledge have the responsibility but no mandate. Tthey are not going to be able to act on their knowledge or react to what they learn by being responsible. This is hard to change because those in charge of the mandate enjoy their power without having to pay for the consequences and may develop an illusion of having enough knowledge (see “Only mandate” above).
To sum it up, knowledge enables an informed mandate and the mandate should be followed with responsibility. The responsibility in turn leads to learnings which improve the knowledge.
It’s a loop! The ownership trio can be summed up as “You build it, you own it”.
With all the funny labels:
More from Alex
Calculating the SLA of a system behind a CDN
This article builds on a previous and uses what we’ve learned to calculate the SLA of a system that is behind a CDN. If…
Some practical advice when setting SLA
This article is about deciding the SLA. If you want to learn about how to write the SLA contract, there’s a great guide…
Calculating composite SLA
How to serial and parallel dependencies affect the total SLA