A tale of evolution

When the software is successful, it usually grows in size and complexity but the brain of humans behind it, doesn’t!

Adding more humans may slightly improve the situation but it has its limits. A better solution is to divide the software into logical pieces small enough for a few humans to cope with and conquer the complexity by integrating all these pieces together.

This division is usually guided by two main tenets:

  • Separation of concerns: group relevant pieces together
  • Loose coupling: minimize the interaction surface between those parts to allow internal change with minimal external consequences

There is a lot written about how to structure big software. Let’s switch gears and focus on one particular type of software that I have been dealing with.

Web as a runtime

A little over 10 year ago I switched from desktop application development to web. Those were the dinosaur days of the web when jQuery was a hot skill, CSS was written “by hand”, tests were manual and most user interactions would reload the page!

Admittedly I switched a bit late, but a revolution had already started. Ajax was proving the browser as a viable runtime environment for ever more complex applications.

The potentials were huge: cross platform, installation free applications with a flexible GUI accessible from anywhere!

With UX becoming a focus of the industry and the advent of V8, JavaScript gained ground as a popular option for implementing complex Single Page Applications (SPA) relieving the backend from part of its traditional workload and even storage. APIs became dominant, NoSQL and Serverless became popular and a new market emerged for the likes of Firebase, Amplify and Backendless.

Electron, Cordova, Node.js, Lambda, Node-Red brought the web technology to desktop, mobile, backend, serverless and IoT applications respectively.

“Any application that can be written in JavaScript, will eventually be written in JavaScript.” — Jeff Atwood

Our massive app

Fast forward to a little over 3 years ago I joined a multinational media company. The product was a white label media site framework that was used by any of 25+ company brands to build their user-facing sites.

Over the years the frontend and backend each grew to over 180+ KSLOC (thousand source lines of code, ignoring comments and empty lines) maintained by a community of 150+ developers. Thousands of unit, integration, system, audit and accessibility tests ensure the quality of multiple daily deploys to production. Some of these sites rank among the top 10 most popular in the country they operate in, serving millions of requests per day.

This article is about how this massive machinery came to be and why it is structured the way it is and how the complexity is tamed.

The domain

We didn’t start there overnight. It all started with 1 site a little over 5 years ago. The company had a CMS (content management system) to create and manage content (in this case news articles). The CMS was headless meaning it was merely concerned with populating the content database and exposing it via an API. The task of actually rendering this content for the end user was left to the clients (remember separation of concerns?). Ghost is another popular headless CMS.

One problem, multiple solutions

Since all the resources was focused on building the CMS, the client rendering was left to the brands to figure out. In a culture of autonomy and tight deadlines, each brand came up with their own solution to build a site on top of the content API. These solutions were drastically different: some used PHP, some used React while others used ESI to build what was internally code named Farticles (for Fast Articles).

Over time, this autonomy turned out to be counter productive and literally against the agility that the company so desperately needed to innovate and compete for user attention:

  • Duplication of work: if one brand had a feature that another one needed, the other brand had to reimplement it from scratch due to incompatible underlying technologies.
  • Resource sharing: due to technology fragmentation, it was hard to share the engineers and experience across brands.
  • High cost: every brand had their own way of fixing bugs, building, testing and deploying the product which meant the company ended up paying for multiple solutions to the same problem.

Take 1: The SDK

The first stab at the unification problem was to share the implementation for some key features that every site needed like: the video player, image gallery, etc.

The stakes were high. Some of the most experienced and skilled employees and consultants were put into the new Core team to solve that problem and they came up with a SDK: a mono repo of packages that could be adopted by any site to solve a specific problem. It was very much aligned with the industry best practices at the time. Think Material UI but focused on the company’s use cases.

Advantages of the SDK

  • Code sharing: SDK is a feasible way to share code to multiple consumers.
  • Abstraction: Building SDK is a good exercise in loose coupling because it enforces thinking about the abstraction layers and separation of concerns.

For a while things looked promising until they didn’t.

Disadvantages of the SDK

  • Ownership: The Core team was solely responsible for the SDK. This meant for the most part, every feature request or bug fix had to go through them. The brand developers had to negotiate every use case and fight for prioritization in an ever growing backlog. The Core became a bottleneck.
  • Mandate: So long the migration to adopt the SDK was optional, the brands didn’t see a need to prioritize it. The Core team didn’t have any mandate to enforce the adoption either.
  • Innovation: if the backend introduced feature A that was incompatible with SDK version 1.1, but supported in SDK version 2.0, they could not roll it out until every site had upgraded their SDK to version 2.0. This hindered the pace of innovation.
  • Incomplete SDK QA: creating an SDK is hard especially because the code is supposed to run in an unknown context. An SDK is inherently impossible to test without a context, so it’s often tested against a mock context which doesn’t really guarantee a quality of any kind in its final real habitat.
  • Time consuming site QA: whenever a new version of the SDK was released, the sites had to go through the daunting task of ensuring that everything works as expected even if the resulting PR would be just a version bump! Due to loose coupling nature of the SDK, the integration testing was the first chance of proper quality assurance.
  • Feasibility: Some sites were technologically incompatible with the React which was used by the SDK. As far as they were concerned the problems that this initiative set out to solve were untouched. The alternative was to do a massive refactoring which was not favoured by the developers who created that solution in the first place.
  • Cost: As the sites were allowed to keep their fragmented technologies, the company continued paying for all those fragmented implementations and the Core team on top of that! All while the company was facing fierce competition from Google and Facebook.
  • Complexity: There’s a whole lot of ceremony for packaging, publishing and consuming a package (or in this case tens of packages): more build & deploy pipelines, private NPM registry, tooling for automating the updates (think dependobot) and user documentation (cause the code is “hidden” or hard to see).
  • Heavier artifacts: packaged as independently consumable code, the individual deliverables cannot have any hard assumption about the context they are used. Therefore they bundled all of their dependencies. If two packages used lodash, it either had to be a peer dependency which meant extra steps or in case of TypeScript or WebPack it meant duplicated boilerplate that the transpilers inject into the result. This wasn’t a major issue, but inefficient at best. Size still matters in the front end world.

We haven’t even touched the whole issue of styling and theming of components abstracted away from their context. That needs a separate article on its own. To be fair, the Core team did their absolute best to fight these issues and they used the best tooling for the job (Lerna, StoryBook, Semantic Release, etc.). But all those tools could not overcome a simple fact:

SDK was not the best abstraction for the job!

The bottom line didn’t justify the effort, but going back to the fragmented mayhem wasn’t reasonable either. After all, these were just a bunch of media sites serving exactly the same purpose, there was no logical reason to have multiple implementations for the same problem.

Take 2: The Platform

The Core team was not directly responsible for any brand but indirectly it was accountable for all of them. The company decided that instead of building a library of optional components, the Core team should pivot to building a platform for an actual media site that serves actual users.

The Core team got to work. This was when I joined the team. To cut the delivery time, we recycled the code from the SDK as much as possible and developed the rest of the platform. To ensure realistic results, the company decided to migrate one of their most successful brands to this platform! The stakes were high. The team delivered. Sadly, the original team behind that brand gradually left because they:

  • felt forced to a new tech stack
  • lost the autonomy they were used to.

The company was determined to unify the tech stacks and this was just the beginning.

The platform’s frontend was entirely in one repo. We called it a shared repo because its ownership was shared between multiple teams.

At a high level it looked like this:

  • src/core: the platform and default components
  • src/sites/BRAND_NAME: site configs and any site-specific component. The sites could also override the core components in their folder

The birth of a community

Fortunately this new model started to pay off. As we migrated brand after brand to the white label platform, we worked more closely together and turned into a community. It wasn’t uncommon for engineers from different brands to help each other or collaborate in architecting and implementing new features. The Core team’s role was reduced to site reliability engineering (SRE) and participating in architectural decisions to ensure that the solution fits into the grand scheme of things.

But isn’t it just a monorepo?

It’s important to distinguish between a shared repo and a monorepo: a monorepo is one repository that has multiple different artifacts (like the aforementioned SDK). The shared repo had one artifact: the white label news site that was configured to have a unique look and activated slightly different feature set. A key aspect of a shared repo is its shared ownership. We’ll get to that in a bit.

But isn’t it just a monolith?

I’m going to be careful about using the word monolith because that word has got some bad rap, especially in the backend world since microservices became all the rage. One of the core benefits of microservices is to distribute the load across multiple machines.

For the SPA applications, the whole code is going to run in the context of one browser window for all practical purposes. So a frontend monolith is not such a crazy idea.

But yes, if calling it a monolith simplifies the world, the shared repo is a monolith with a complex ownership model. It is a good pattern for organizing white label platforms for example. We will dig deeper into how we reduced the downsides of monolith like complexity, bloat and tight coupling.

Why not micro frontends?

Micro frontends allow multiple teams to own loosely coupled functionalities of the frontend and deploy them at different pace. Moreover, they allow full stack teams where the frontend functionalities can be tightly coupled with the backend while the team takes full ownership of the features from end to end. There’s nothing inherently bad about it but one needs to be aware of the conway’s law:

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.

This improves the SDK versioning issue that was mentioned. Inspired by the microservices, they naturally bring some of their issues into the frontend world as well.

  • Integration: the sum of all these micro frontends integrates in the browser. This is very similar to the issues that the SDK has: the code is not developed in its final context. In fact the micro frontends are even worse because they really integrate in the end user’s browser while the SDK users deal with that headache at development time.
  • Overhead: each of those micro frontends usually runs in their own browser context (eg. iframe or custom elements) which are isolated from each other. Therefore each comes bundled with their dependencies. There’s also a communication overhead between those parts that is more expensive than a simple function call. The end user’s browser downloads and executes more code. This hurts the data usage and battery life of the mobile users the most. It’s not the most efficient architecture but it has its place.
  • Fragmentation: one of the possibilities that micro frontends unlock is to allow each part to use their own tech stack. While this level of autonomy can potentially lead to using the right tool for the job, it can also lower the bar for diverging tech stack and ending up with a fragmented tech stack that hurts human resource sharing. Guaranteeing a consistent UX commands rigorous work to ensure that when those parts use different technologies, that implementation detail is hidden from the end user.

There’s more, but the micro frontend architecture can be an excellent option for web apps that are inherently composed of loosely coupled parts and the UX does not require a seamless integration between those parts.

Advantables of the share repo

  • Quality: Testing frontend SDKs in isolation is much harder than testing everything together in a real setup as the end user interacts with it. You’re continuously integration mode. Also, there are simply more eyes on the same code base: “given enough eyeballs, all bugs are shallow” Linus’ law
  • Consistency: There’s a fair bit of boilerplate that is duplicated (and need to be kept in sync for consistency) across individual repositories for example: Linting rules, .editorconfig, documentation, debugging profiles, setting up the test framework, build & deploy pipeline.
  • High cohesion: many features or bug fixes touch code that would otherwise be scattered across different repos and suffer from the asynchronous deployments. By contrast, in a shared repo, these changes could come in one cohesive PR which made the code much easier to reason about.
  • Effectivity: when working on an SDK it is easy to lose sight of the big picture because by definition the focus is inside the abstraction. With a foggy holistic picture, the developers are not as effective because the consequences of their choices are not immediately tangible. With shared repo on the other hand, they are constantly in integration mode where the consequences surface just like the final product is experienced by the end users.

Disadvantages of the share repo

  • Time: since there is more code, everything takes longer: linting, building, testing, etc.
  • Review fatigue: The commit rate is too high, so a clear ownership model is needed along with tooling to help with that (see how Chromium team uses codeowner)
  • Gatekeeping: People become specialized in parts of the code which is not a bad thing on its own but for quality work they may need to collaborate tightly. A single architect or tech lead may have difficulty to be on top of the latest state of the repo, but that’s the model I recommend for avoiding anarchy
  • Code sharing: If some code has other external usage, developers may end up copying it because the SDK workflows and versioning are removed at that point. Another alternative is deploying it separately as a package which turns it into a monorepo (with multiple deployables).
  • Overrides: the structure allows any site to override the shared functionality as they see fit. Despite great flexibility, it created complexity: reading the code, one cannot easily deduce what code is going to execute because the config in production could change that, it was hard to reason about.

Tip: our web platform used JavaScript but the language itself is very hard to use in massive repos. A type system can drammatically help to spot the errors at development time instead of the runtime. Therefore we set out to refactor the code to TypeScript. If you want to know more about when TypeScript can be the right tool, I’ve written about it in another post.

Take 2.1: Ownership

Despite promoting collaboration, the shared repo introduced a key ownership problem which may be a remnant of the way the teams traditionally used to work.

The ownership of the src/core folder was unclear:

  • All brands used the common code that lived there. But it was named after the Core team. So they assumed that any change in that folder should be checked with the Core team where in reality, if they broke something, they would most probably break another brand.
  • The Core team acknowledged the importance of that folder but as the brand developers did most of the work for their respective site and evolved that code, the Core lost touch with the actual brand problems and it was hard to contribute in meaningful ways when they asked for it.
  • As this folder grew, it got harder for the Core team to keep tabs on its latest evolution. We felt that we cannot own this properly and at the same time spend time on our SRE and DevOps responsibilities.

Fortunately we were not the first team to face this issue and an inspiration came from how people handle large repositories for other products. In this case, the inspiration came from the Chrome team, specifically their OWNER files. We used Github Code owners already but it was basically reflecting the problem: the owner and reviewer of the entire src/core was guess who? The Core team!

So we wrote a script to check the contributors to every single subdirectory of the repo to identify who is most suitable to review PRs to that part of the code. Binding an individual to a directory would not be optimal because:

  • People would change, the code may end up without an owner
  • The functionalities may be spread across different directories which may be owned by different people. This makes the PR review harder

So we manually identified all the key features of the product, created sub teams after those features and assigned the relevant people to them. For example one team would review everything that touched the Paywall functionality and another would review everything that touched the video player. A key tenet was to create teams for features instead of mapping the teams in the organizational chart. The goal was to put the most relevant people across brands in charge of the code that they cared about and understood well enough to review it properly.

Conclusion

In this article we shared our experience with the shared repo for building platform products and compared it with relevant alternatives.

Let me know if you have any comments or questions.

Knowledge Worker, MSc Systems Engineering, Tech Lead, Web Developer