Splitting engineering teams into defense and offense

103 pointsposted 14 hours ago
by dakshgupta

70 Comments

solatic

4 hours ago

This pattern has a smell. If you're shipping continuously then your on-call engineer is going to be fixing the issues the other engineers are shipping, instead of those engineers following up on their deployments and fixing issues caused by those changes. If you're not shipping continuously, then anyway customer issues can't be fixed continuously, and your list of bugs can be prioritized by management with the rest of the work to be done. The author quotes maker vs. manager schedules, but one of the conclusions of following that is that engineers don't talk directly to customers, because "talking to customers" is another kind of meeting, which is a "manager schedule" kind of thing rather than a "maker schedule" kind of thing.

There's simply no substitute for Kanban processes and for proactive communication from engineers. In a small team without dedicated customer support, a manager takes the customer call, decides whether it's legitimately a bug, creates a ticket to track it and prioritizes it in the Kanban queue. An engineer takes the ticket, fixes it, ships it, communicates that they shipped something to the rest of their team, is responsible for monitoring it in production afterwards, and only takes a new ticket from the queue when they're satisfied that the change is working. But the proactive communication is key: other engineers on the team are also shipping, and everyone needs to understand what production looks like. Management is responsible for balancing support and feature tasks by balancing the priority of tasks in the Kanban queue.

dakshgupta

4 hours ago

This is a real shortcoming, the engineers that ship feature X will not be responsible for the immediate aftermath. Currently we haven’t seen this hurt in practice, probably because we are very small and in-person, but you might be correct and it would then likely be the first thing that breaks about this as our team grows.

safety1st

2 hours ago

I commented a while back on another post about a company I worked at which actually made developers spend a few days a year taking tech support calls. This takes their responsibility for and awareness of the aftermath of their work to a whole new level and from my perspective was very effective. Could be an alternate route to address the same problem.

thih9

3 hours ago

> on-call engineer is going to be fixing the issues the other engineers are shipping, instead of those engineers following up on their deployments and fixing issues caused by those changes

Solution: don’t. If a bug has been introduced by the currently running long process, forward it back. This is not distracting, this is very much on topic.

And if a bug is discovered after the cycle ends - then the teams swap anyway and the person who introduced the issue can still work on the fix.

dakiol

13 hours ago

I once worked for a company that required from each engineer in the team to do what they called “firefighting” during working hours (so not exactly on-call). So for one week, I was triaging bug tickets and trying to resolve them. These bugs belonged to the area my team was part of, so it affected the same product but a vast amount of micro services, most of which I didn’t know much about (besides how to use their APIs). It didn’t make much sense to me. So you have Joe punching code like there’s no tomorrow and introducing bugs because features must go live asap. And then it’s me the one fixing stuff. So unproductive. I always advocated for a slower pace of feature delivery (so more testing and less bugs on production) but everyone was like “are you from the 80s or something? We gotta move fast man!”

onion2k

4 hours ago

This sort of thing is introduced when the number of bugs in production, especially bugs that aren't user-facing or a danger to data (eg 'just' an unhandled exception or a weird log entry), gets to a peak and someone decides it's important enough to actually do something about it. Those things are always such a low priority that they're rarely dealt with any other way.

In my experience whenever that happens someone always finds an "oh @#$&" case where a bug is actually far more serious than everyone thought.

It is an approach that's less productive than slowing down and delivering quality, but it's also completely inevitable once a team/company grows to a sufficient size.

resonious

2 hours ago

I honestly don't really like the "let's slow down" approach. It's hard for me to buy into the idea that simply slowing down will increase product quality. But I think your comment already contains the key to better quality: close the feedback loop so that engineers are responsible for their own bugs. If I have the option of throwing crap over the wall, I will gravitate towards it. If I have to face all of the consequences of my code, I might behave otherwise.

isametry

25 minutes ago

Slow is smooth, smooth is fast?

dakshgupta

13 hours ago

This is interesting because it’s what I imagine would happen if we scaled this system to a larger team - offense engineers would get sloppy, defensive engineers would get overwhelmed, even with the rotation cycles.

Small, in-person, high-trust teams have the advantage of not falling into bad offense habits.

Additionally, a slower shipping pace simply isn’t an option, seeing as the only advantage we have over our giant competitors is speed.

jedberg

12 hours ago

> offense engineers would get sloppy

Wouldn't they be incentivized to maintain discipline because they will be the defensive engineers next week when their own code breaks?

dakshgupta

12 hours ago

I suspect as the company gets larger time between defensive sprints will get longer, but yes, for smaller teams this is what keeps quality high, you’ll have to clean up your own mess next week.

DJBunnies

13 hours ago

I think we’ve worked for the same org

smugglerFlynn

2 hours ago

Constantly working in what OP describes as defence might also be negatively affecting the perception of cause and effect of own actions:

   Specifically, we show that individuals following clock-time [where tasks are organized based on a clock**] rather than event-time [where tasks are organized based on their order of completion] discriminate less between causally related and causally unrelated events, which in turn increases their belief that the world is controlled by chance or fate. In contrast, individuals following event-time (vs. clock-time) appear to believe that things happen more as a result of their own actions.[0]
** - in my experience, clock based organisation seems to be very characteristic to what OP describes as defensive, when you become driven by incoming priorities and meetings

Broader article about impact of schedules at [1] is also highly relevant and worth the read.

   [0] - https://psycnet.apa.org/record/2014-44347-001    
   [1] - https://hbr.org/2021/06/my-fixation-on-time-management-almost-broke-me

Towaway69

2 hours ago

What's wrong with collaboratively working together? Why is there a need to create an atificial competition between a "offence" and a "defence" team?

And why should team members be collaborative amongst their team? E.g. why should the "offence" team members suddenly help each other if it's not happening generally?

This sounds a lot like JDD - Jock Driven Development.

Perhaps the underlying problems of "don't touch it because we don't understand it" should be solved before engaging in fake competition to increase the stress levels.

megunderstood

2 hours ago

Sounds like you didn't read the article.

The idea has nothing to do with creating artificial competition and it is actually designed as a form of collaboration.

Some work requires concentration and the defensive team is there to maintain the conditions for this concentration, i.e. prevent the offensive team from getting interrupted.

Towaway69

2 hours ago

Ok, that might well be the case! Many apologies for my mistaken assumptions.

Then perhaps the terminology - for me - has a different meaning.

ndndjdjdn

an hour ago

This is probably devops. A single team talking full responsibility and swapping oncall-type shifts. These guys know their dogfood.

You want the defensive team to work on automating away stuff that pays off for itself in the 1-4 week timeframe. If they get any slack to do so!

fryz

14 hours ago

Neat article - I know the author mentioned this in the post, but I only see this working as long as a few assumptions hold:

* avg tenure / skill level of team is relatively uniform

* team is small with high-touch comms (eg: same/near timezone)

* most importantly - everyone feels accountable and has agency for work others do (eg: codebase is small, relatively simple, etc)

Where I would expect to see this fall apart is when these assumptions drift and holding accountability becomes harder. When folks start to specialize, something becomes complex, or work quality is sacrificed for short-term deliverables, the folks that feel the pain are the defense folks and they dont have agency to drive the improvements.

The incentives for folks on defense are completely different than folks on offense, which can make conversations about what to prioritize difficult in the long term.

dakshgupta

13 hours ago

These assumptions are most likely important and true in our case, we work out of the same room (in fact we also all live together) and 3/4 are equally skilled (I am not as technical)

october8140

2 hours ago

My first job had a huge QA team. It was my job to work quickly and it was their job to find the issues. This actually set me up really poorly because I got in the habit of not doing proper QA. There were at least 10 people doing it for me. When I left it took awhile for me to learn what properly QAing my own worked looked like.

eschneider

14 hours ago

If the event-driven 'fixing problems' part of development gets separated from the long-term 'feature development', you're building a disaster for yourself. Nothing more soul-sucking than fixing other people's bugs while they happily go along and make more of them.

dakshgupta

13 hours ago

There is certainly some razor applied on whether a request is unique to one user or is widely requested/likely to improve the experience for many users

stopachka

12 hours ago

> While this is flattering, the truth is that our product is covered in warts, and our “lean” team is more a product of our inability to identify and hire great engineers, rather than an insistence on superhuman efficiency.

> The result is that our product breaks more often than we’d like. The core functionality may remain largely intact but the periphery is often buggy, something we expect will improve only as our engineering headcount catches up to our product scope.

I really resonate with this problem. It was fun to read. We've been tried different methods to balance customers and long-term projects too.

Some more ideas that can be useful:

* Make quality projects an explicit monthly goal.

For example, when we noticed our the edges in our surface area got too buggy, we started a 'Make X great' goal for the month. This way you don't only have to react to users reporting bugs, but can be proactive

* Reduce Scope

Sometimes it can help to reduce scope; for example, before adding a new 'nice to have feature', focus on making the core experience really great. We also considered pausing larger enterprise contracts, mainly because it would take away from the core experience.

---

All this to say, I like your approach; I would also consider a few others (make quality projects a goal, and cut scope)

d4nt

3 hours ago

I think they’re on to something, but the solution needs more work. Sometimes it’s not just individual engineers who are playing defence, it’s whole departments or whole companies that are set up around “don’t change anything, you might break it”. Then the company creates special “labs” teams to innovate.

To borrow a football term, sometimes company structure seems like it’s playing the “long ball” game. Everyone sitting back in defence, then the occasional hail mary long pass up to the opposite end. I would love to see a more well developed understanding within companies that certain teams, and the processes that they have are defensive, others are attacking, and others are “mid field”, i.e. they’re responsible for developing the foundations on which an attacking team can operate (e.g. longer term refactors, API design, filling in gaps in features that were built to a deadline). To win a game you need a good proportion of defence, mid field and attack, and a good interface between those three groups.

jedberg

14 hours ago

> this is also a very specific and usually ephemeral situation - a small team running a disproportionately fast growing product in a hyper-competitive and fast-evolving space.

This is basically how we ran things for the reliability team at Netflix. One person was on call for a week at a time. They had to deal with tickets and issues. Everyone else was on backup and only called for a big issue.

The week after you were on call was spent following up on incidents and remediation. But the remaining weeks were for deep work, building new reliability tools.

The tools that allowed us to be resilient enough that being on call for one week straight didn't kill you. :)

dakshgupta

13 hours ago

I am surprised and impressed a company at that scale functions like this. We often internally discuss if we can still doing this when we’re 7-8 engineers.

jedberg

12 hours ago

I think you're looking at it backwards. We were only able to do it because we had so many engineers that we had time to write tools to make the system reliable enough.

On call for a week at a time only really works if you only get paged at night once a week max. If you get paged every night, you will die from sleep deprivation.

dmoy

5 hours ago

Moving from 24/7 oncall to 12 hour shifts trading off with another continent is really nice

cgearhart

13 hours ago

This is often harder at large companies because you very rarely make career progress playing defense, so it becomes very tricky to do it fairly. It can work wonders if you have the right teammates, but it’s almost a prisoners dilemma game that falls apart as soon as one person opts out.

dakshgupta

13 hours ago

Good point, we will usually only rotate when the long running task is done but eventually we’ll arrive at some feature that takes more then a few weeks to build so will need to restructure our methods then.

shalmanese

13 hours ago

To the people pooh poohing this, do y’all really work with such terrible coworkers that you can’t imagine an effective version of this?

You need trust in your team to make this work but you also need trust in your team to make any high velocity system work. Personally, I find the ideas here extremely compelling and optimizing for distraction minimization sounds like a really interesting framework to view engineering from.

johnnyanmac

an hour ago

work with terrible management that can't imagine an effective version of this.

jph

13 hours ago

Small teams shouldn't split like this IMHO. It's better/smarter/faster IMHO to do "all hands on deck" to get things done.

For prioritization, use a triage queue because it aims the whole team at the most valuable work. This needs to be the mission-critical MVP & PMF work, rather than what the article describes as "event driven" customer requests i.e. interruptions.

dakshgupta

13 hours ago

A triage queue makes a lot of sense, only downside being the challenge of getting a lot done without interruption.

bvirb

13 hours ago

In a similar boat (small team, have to balance new stuff, maintenance, customer requests, bugs, etc).

We ended up with a system where we break work up into things that take about a day. If someone thinks something is going to take a long time then we try to break it down until some part of it can be done in about a day. So we kinda side-step the problem of having people able to focus on something for weeks by not letting anything take weeks. The same person will probably end up working on the smaller tasks, but they can more easily jump between things as priorities change, and pretty often after doing a few of the smaller tasks either more of us can jump in or we realize we don't actually need to do the rest of it.

It also helps keep PRs reasonably sized (if you do PRs).

philipwhiuk

10 hours ago

We have a person who is 'Batman' to triage production issues. Generally they'll pick up smaller sprint tasks. It rotates every week. It's still stuff from the team so they aren't doing stuff unknown (or if they are, it's likely they'll work on it soon).

The aim is generally not to provide a perfect fix but an MVP fix and raise tickets in the queue for regular planning.

It rotates round every week or so.

My company's not very devops so it's not on-call, but it's 'point of contact'.

jwrallie

11 hours ago

I think interruptions damage the productivity overall, not only of engineers. Maybe some are unaware of it, and others simply don’t care. They don’t want to sacrifice their own productivity by waiting on someone busy, so they interrupt and after getting the information they want, they feel good. From their perspective, the productivity increased, not decreased.

Some engineers are more likely to avoid interrupting others because they can sympathize.

svilen_dobrev

12 hours ago

IMO the split, although good (the pattern is "sacrifice one person" as per Coplien/Harrision's Organisational patterns book [0]), is too drastic. It should be not defense vs offense 100% with a wall inbetween, but for each and every issue (defense) and/or feature (offense), someone has to pick it and become the responsible (which may or may not mean completely doing it by hirself). Fixing a bug for an hour-or-two sometimes has been exactly the break i needed in order to continue digging some big feature when i feel stuck.

And the team should check the balances once in a while, and maybe rethink the strategy, to avoid overworking someone and underworking someone else, thus creating bottlenecks and vacuums.

At least this is the way i have worked and organised such teams - 2-5 ppl covering everything. Frankly, we never had many customers :/ but even one is enough to generate plenty of "noise" - which sometimes is just noise, but if good customer, will be mostly real defects and generally under-tended parts. Also, good customers accept a NO as answer. So, do say more NOs.. there is some psychological phenomena in software engineering in saying yes and promising moonshots when one knows it cannot happen NOW, but looks good..

have fun!

[0] https://svilendobrev.com/rabota/orgpat/OrgPatterns-patlets.h...

stronglikedan

13 hours ago

Everyone on every team should have something to "own" and feel proud of. You don't "own" anything if you're always on team defense. Following this advice is a sure fire way to have a high churn rate.

FireBeyond

13 hours ago

Yup, last place I was at I had engineers begging me (PM) to advocate against this, because leadership was all "We're going to form a SEAL team to blaze out [exciting, interesting, new, fun idea/s]. Another team will be on bug fixes."

My team had a bunch of stability work, and bug fixes (and there was a lot of bugs and a lot of tech debt, and very little organizational enthusiasm to fix the latter).

Guess where there morale was, compared to some of the other teams?

000ooo000

an hour ago

Splitting a team by interesting/uninteresting work is a comically bad idea. It's puzzling that it ever gets pitched, let alone adopted.

Edit: I mean an ongoing split, not a rotation

LatticeAnimal

13 hours ago

From the post:

> At the end of the cycle, we swap.

They swap teams every 2-4 weeks so nobody will always be on team defense.

ninininino

13 hours ago

You didn't read the article did you, they swap every 2 weeks between being on offense and defense.

madeofpalk

12 hours ago

Somewhat random side note - I find it so fascinating that developers invented this myth that they’re the only people who have ‘concentration’ when this is so obviously wrong. Ask any ‘knowledge worker’ or yell even physical labourer and I’m sure they’ll tell you about the productivity of being "in the zone" and lack of interruptions. Back in early 2010s they called it ‘flow’.

000ooo000

44 minutes ago

The article doesn't say or suggest that. It says it applies to engineers.

dakshgupta

10 hours ago

My theory is that to outsiders software development looks closer to other generic computer based desk jobs than to the job of a writer or physical builder, so to them it’s less obvious that programming needs “flow” too.

bradarner

14 hours ago

Don't do this to yourself.

There are 2 fundamental aspects of software engineering:

Get it right

Keep it right

You have only 4 engineers on your team. That is a tiny team. The entire team SHOULD be playing "offense" and "defense" because you are all responsible for getting it right and keeping it right. Part of the challenge sounds like poor engineering practices and shipping junk into production. That is NOT fixed by splitting your small team's cognitive load. If you have warts in your product, then all 4 of you should be aware of it, bothered by it and working to fix it.

Or, if it isn't slowing growth and core metrics, just ignore it.

You've got to be comfortable with painful imperfections early in a product's life.

Product scope is a prioritization activity not an team organization question. In fact, splitting up your efforts will negatively impact your product scope because you are dividing your time and creating more slack than by moving as a small unit in sync.

You've got to get comfortable telling users: "that thing that annoys you, isn't valuable right now for the broader user base. We've got 3 other things that will create WAY MORE value for you and everyone else. So we're going to work on that first."

MattPalmer1086

12 hours ago

I have worked in a small team that did exactly this, and it works well.

It's just a support rota at the end of the day. Everyone does it, but not all the time, freeing you up to focus on more challenging things for a period without interruption.

This was an established business (although small), with some big customers, and responsive support was necessary. There was no way we could just say "that thing that annoys you, tough, we are working on something way more exciting." Maybe that works for startups.

bradarner

8 hours ago

Yes, very good point. I would argue that what I’m suggesting is particularly well suited to startups. It may be relevant to larger companies as well but I think the politics and risk profile of larger companies makes this nearly impossible to implement.

dakshgupta

13 hours ago

All of these are great points. I do want to add we rotate offense and defense every 2-3 weeks, and the act of doing defense which is usually customer facing gives that half of the team a ton of data to base the next move on.

bradarner

13 hours ago

The challenge is that you actually want your entire team to benefit from the feedback. The 4 of you are going to benefit IMMENSELY from directly experiencing every single pain point- together.

As developers we like to focus. But there is vast difference between "manager time" and "builder time" and what you are experiencing.

You are creating immense value with every single customer interaction!

CUSTOMER FACING FIXES ARE NOT 'MANAGER TIME'!!!!!!

They are builder time!!!!

The only reason I'm insisting is because I've lived through it before and made every mistake in the book...it was painful scaling an engineering and product team to >200 people the first time I did it. I made so many mistakes. But at 4 people you are NOT yet facing any real scaling pain. You don't have the team size where you should be solving things with organizational techniques.

I would advise that you have a couple of columns in a kanban board: Now, Next, Later, Done & Rejected. And communicate it to customers. Pull up the board and say: "here is what we are working on." When you lay our the priorities to customers you'd be surprised how supportive they are and if they aren't...tough luck.

Plus, 2-3 weeks feels like an eternity when you are on defense. You start to dread defense.

And, it also divorces the core business value into 2 separate outcomes rather than a single outcome. If a bug helps advance your customers to their outcome, then it isn't "defense" it is "offense". If it doesn't advance your customer, why are you doing it? If you succeed, all of your ugly, monkey patched code will be thrown away or phased out within a couple of years anyway.

FridgeSeal

11 hours ago

Whilst I very much agree with you, actually doing this properly and pulling this off requires PM’s and/or Account Managers who are willing and capable of _actually managing_ customers.

Many, many people I’ve dealt with in these roles don’t or can’t, and seem to think their sole task is to mainline customer needs into dev teams. The PM’s I’ve had who _actually_ do manage back properly had happier dev teams, and ultimately happier clients, it’s not a mystery, but for some reason it’s a rare skill.

bradarner

8 hours ago

Yes completely agree. This is hard for a PM to do.

I’m assuming that the OP is a founder and can actually make these calls.

dijksterhuis

3 hours ago

the reasons PM stuff is ‘hard’ in my, admittedly limited, experience often seems to come down to

- saying No, and sticking to it when it matters — what you’ve mentioned.

- knowing how the product gets built — knowing *the why behind the no*.

PMs don’t usually have the technical understanding to do the second one. so the first one falls flat because why would someone stick to their guns when they do not understand why they need to say No, and keep saying No.

there are cases where talking to customer highlights a mistaken understanding in the *why we’re saying No*. those moments are gold because they’re challenging crucial assumptions. i love those moments. they’re basically higher level debugging.

but, again, without the technical understanding a PM can’t notice those moments.

they end up just filling up a massive backlog of everything because they don’t know how to filter wants vs. needs and stuff.

— also i agree with a lot of what you’ve said in this chain of discussion.

get it right first time, then keep it right is so on point these days. especially for smaller teams. 90% of teams are not the next uber and don’t need to worry about massive growth spurts. most users don’t want the frontend changing every single day. they want stability.

worry about getting it right first. be like uber/google if you need to, when you need to.

johnrob

11 hours ago

I thought you made the rotation aspect quite clear. Everyone plays both roles and I’m sure when a bigger issue arises everyone becomes aware regardless. Personally, I like this because as a dev I can set expectations accordingly. Either I plan for minimal disruption and get it, or take the on call side which I’m fine with so long as I’m not asked to do anything else (frustration is when your expected to build features while getting “stuck” fixing prod issues).

ramesh31

14 hours ago

To add to this, ego is always a thing among developers. Your defensive players will inevitably end up resenting the offense for 1. leaving so many loose ends to pick up and 2. not getting the opportunity for greenfield themselves. You could try to "fix" that by rotating, but then you're losing context and headed down the road toward man-monthing.

CooCooCaCha

13 hours ago

Interesting that you describe it as ego. I don’t think a team shoveling shit onto your plate and disliking it is ego.

I feel similar things about the product and business side, it often feels like people are trying to pass their job off to you and if you push back then you’re the asshole. For example, sending us unfinished designs and requirements that haven’t been fully thought through.

I imagine this is exactly how splitting teams into offense and defense will go.

FridgeSeal

13 hours ago

> For example, sending us unfinished designs and requirements that haven’t been fully thought through

Oh man. Once had a founder who did this to the dev team: blurry, pixelated screenshots with 2 or 3 arrows and vague “do something like <massively under specified statement>”.

The team _requested_ that we have a bit more detail and clarity in the designs, because it was causing us significant slowdown and we were told “be quiet, stop complaining, it’s a ‘team effort’ so you’re just as at fault too”.

Unsurprisingly, morale was low and all the good people left quickly.

dakshgupta

13 hours ago

To add - I personally enjoy defense more because the quick dopamine hits of user requests fix -> fix issue -> tell user -> user is delighted is pretty addictive. Does get old after a few weeks.

joshhart

13 hours ago

TLDR: The author basically re-invented oncall rotations.

dakshgupta

13 hours ago

This makes me want to delete the post.

Xeamek

12 hours ago

Please don't.

I personally found the idea inspiring and the article itself is explaining it succinctly. Even if it's not completely revolutionary, it's small, self containing concept that's actionable.

Lowley surprised why there are so many harsh voices in this thread, but the article definitely has merrit, even if it won't be usefull/possible to implement for everyone

thesandlord

12 hours ago

Don't do that! This was a great post with a lot to learn from.

The fact you came to a very similar solution from first principles is very interesting (assuming you didn't know about this before!)

stopachka

12 hours ago

I resonated with your post Daksh. Keep up the good work

candiddevmike

13 hours ago

Or the idea of an "interrupt handler". OP may find other SRE concepts insightful, like error budgets.

wombatpm

4 hours ago

Error budget or recovery cost tracking goes a long way towards defeating the We never have time or money to do it right, but we’ll find time and money to fix it later mindset.

dakshgupta

4 hours ago

I’m generally a strong believer in “if it’s not measured it’s not managed” so this seems like it would be useful to explore. I suspect it’s tricky to assign a cost to a bug though.