Because it worked somewhere else doesn’t mean it’s right for you
Recently, I worked for a company that had organised itself around the Spotify model of squads and tribes. The reasoning was that this structure had made Spotify successful therefore it had to be good. It turned out that even Spotify couldn’t make their model work. It led to silos which prioritised local autonomy over collaboration. Which is crippling to a company past a particular size. As someone who spent a fair chunk of time at Google I’ve also seen Google’s name taken in vain. “If Google are doing it then it must be good” is a common refrain in our industry. The same goes for Amazon, Apple or Facebook.
Google has SREs to run their production services? Great let’s hire a bunch of SREs. Great, what do SREs do? No idea — let’s hire some people who run production services and all our stuff will work magically. Turns out this isn’t a great idea. SRE at Google has a very specific mandate to solve a very specific set of problems. The company structure is set up to provide them with a high degree of autonomy over what they are prepared to support. In principle having a dedicated group of people to run your services might be a good thing.
But, wait, Amazon say that “you build it you run it”. Now we’re saying that teams should be on call for their own services because they need to own their production quality. How can both of these things be true? At what point do you need a dedicated team?
Turns out this isn’t obvious — even if you’ve come from either or both environments. People from Google tend to prefer the SRE model because that feels familiar. People from Amazon tend to prefer “you build it you run it”. The reality is neither company works that way. At Google — most engineering teams are on-call for their service or product. SRE represents the group of people to run the things which really matter — the core infrastructure and critical services. At Amazon — they also have core infrastructure which, guess what, requires dedicated teams.
If you were to extrapolate the key lessons from both they would be that:
- Teams need to be accountable for how their stuff runs in production
- There are things so critical you need dedicated people to run them
- There are skills beyond a typical software engineer team in running systems at planetary scale that you need to underpin your systems
It doesn’t end there. Google builds everything from HEAD — there are almost no branches there. Google uses Kubernetes. I could go on and on. And list the same things for how different companies do things.
The reality is that a soundbite doesn’t give you enough information to make an informed choice unless you understand the problems it is tackling and what the nuance of the problem space are. A given technology which works for a large company may be wholly inappropriate for you (as companies trying to run Kubernetes at scale tend to find out the hard way) or limited in ways you hadn’t anticipated. Spoiler alert — Google’s container infrastructure works so spectacularly because the service mesh is damn near magical and service meshes in the outside world are catching up to that. Also — the reason Kubernetes (or Borg) came about was the company realised that a large chunk of hundreds of thousands of machines was idle at any point in time. Without virtualisation Google becomes prohibitively expensive to operate.
Fred Brooks’ chapter, “No Silver Bullet”, is as true today as it was 45 years ago. In the end we should solve problems and understand how the solution we’ve chosen maps to those problems. We should accept that each new solution brings its own new problems beyond our experience and account for this known unknown in our adoption plan. Most of all let’s stop blindly aping big tech because it doesn’t always help us. If we can’t say why we’re doing something then we buy into a whole world of trouble. I say this as someone who has started way too many sentences with “At Google we used to xyz”.