The Challenges and Traps of Architecting Sociotechnical Systems

--

Loosely-coupled teams enabled by loosely-coupled software architecture is one of the strongest predictors of continuous delivery performance and organizational scaling.

Nicole Forsgren and Jez Humble report this finding in their recently-published book Accelerate, drawing on their extensive empirical research.

“If we a achieve a loosely-coupled, well-encapsulated architecture with an organizational structure to match we can achieve better delivery performance… and substantially grow the size of the engineering organization and increase productivity linearly”

— Nicole Forsgren and Jez Humble in Accelerate

From personal experiences, I’m sure we’ve all learned that getting the boundaries right in sociotechnical systems is extremely important yet monstrously difficult. And there’s more research to back it up.

In another study, Thoughtworks found that, on average, when a piece of work leaves a team (i.e. requires multiple teams to work on) it takes an order of magnitude longer to complete that work.

“There is a high cost associated with work that leaves your team… team boundaries and software boundaries should be isomorphic”

— James Lewis, Thoughtworks

I’ve written and spoken a lot about architecting sociotechnical systems and how to find boundaries. In this post, I’m going to catalogue the most common challenges and anti-patterns to articulate why it’s so hard to get the boundaries right… and why we must focus on more than just boundaries.

Decoupling Streams of Work

Our goal when designing systems is to maximise the speed of delivery and value of the work we deliver. Accordingly, we desperately want to avoid ‘work leaving the team’ because it will take an order of magnitude longer to complete.

We need to define boundaries that empower teams to complete work without depending on other teams.

In order to find such boundaries, we need to encapsulate things that change together. Whenever a team starts on a piece of work they should own all of the code and infrastructure that needs to change in order to deliver the work.

Equally, it’s also about the quality of the work and not just the speed. Where we place boundaries impacts our ability to innovation, the quality of our UX, and execution of our strategy. Diligently-crafted boundaries give us strategic-flexibility.

Monolithic architectures do not allow us to invest in high ROI parts of the system without spreading our investments across low value parts of the system, because everything is tightly coupled. With loosely-coupled architectures we can invest more granularly where payback is greatest.

Isolate high-value and low value capabilities and you can easily invest all your money where ROI is greatest and kill features that cost more to maintain than they are worth

So how do we create these boundaries that gives us super high-speed continuous delivery, extreme strategic-flexibility, and immeasurable team autonomy? We don’t because, such perfect boundaries don’t exist.

The best we can do is to be aware of the fundamental challenges we will face and avoid the common mistakes. I’ve kindly catalogued them for you in the remainder of this article.

Arbitrary Boundaries

One of the most foolish and costly architectural choices you can make is to define arbitrary boundaries. I’ve seen a number of systems where the thinking was “our monoliths are so big we’re going to put all new code in a new codebase”.

Whenever I’ve seen this pattern, there has been a high-rate of co-change between the old and new codebases, resulting in all kinds of fun like constantly having to make the same change in two codebases, or worse… forgetting to.

When that happened on one system I witnessed, it was forgotten to update the price calculation logic in the baby monolith, so customers were seeing one price when they added items to their basked and another price when they were checking out.

Even if code lives in a separate codebase, high-coupling and co-change will exist if boundaries are not cohesive

Shared libraries, shared databases, APIs… there are number of tactical patterns we can use in this situation, but when there is a strong coupling between different codebases, the costs and risks are higher.

When these dependencies span multiple teams we are walking a tightrope. One team making non-backwards changes to a shared domain library or shared database schema that breaks another teams services can result in a lack of trust between teams leading to blame culture, us-vs-them, and a whole range of dysfunctions.

Slicing Up The User Interface

Deciding how to slice up the user interface and distribute ownership among teams is mental discomfort on another level. We’re faced with two broad choices, and both have big drawbacks.

Dedicated Frontend Teams vs Vertical Slices

We can have a dedicated UI team who own the entire frontend (or internal UI). To do so however requires the introduction a handover between frontend and backend team for a large majority of work. Now we have a lot of work ‘leaving the team’.

Teams will come to resent the handover because it slows them down. They will put logic where it doesn’t belong to avoid handovers. Cohesive logic will be scattered and duplicated among different codebases and systems will become brittle.

Alternatively, we can slice the UI up so that different parts of the UI belong to different capabilities, ensuring each part of the UI is owned by a single team. The result is ‘full-stack’, vertically-sliced teams which can implement an entire feature without any dependency on other teams… in theory.

In practice, it’s rare to find many web pages that rely on data and functionality from only a single service. But it can happen as I learned while working on services in the UK government. We had a three step business process, and each web page neatly-belonged to one of the business process steps, and each business process step was a team boundary.

The teams and their services were loosely coupled, stitched together using HTML hyperlinks.

There are still challenges with this approach — it’s trivially easy to lumber users with an inconsistent and annoying UX due to each team using different UX styles and patterns. One team might put labels above a text box while another team puts them on the same line, or different teams using different date pickers (we can minimise with these risk by using design systems and similar).

Vertical slices lead to greater end-to-end autonomy, but they introduce bottlenecks at the UI layer and facilitate disjoined user experiences.

In my experience, a clean UI separation is rarely achievable. Most of the time, a web page will require data from multiple capabilities owned by different teams.

We can try asynchronously replicating all of the data needed for a page into a single service so that one service has all the data to own a page. But some pages have data from 10–20 different services and the data is highly-volatile. That’s an extreme amount of technical effort required that brings a lot of risk.

So what else can we do if we want teams to own vertical slices but each web page cannot be owned by a single capability?

Micro-frontends

We can use micro-frontends to break down individual web pages into UI ‘components’ or fragments provided by different capabilities. I have to admit, though, opinion is extremely divided on UI composition approaches like micro-frontends.

The smart folks at Open Table and Zalando have been using them for years, yet a lot of people I talk to have had bad experiences with them because success with UI compositional approaches depends on the type of application you’re building.

Even if micro-frontends were a panacea, the challenge becomes significantly more difficult when we have multiple devices to support. Is it possible for one team to own a full slice that cuts through web, android, iOS, and other applications and work on them all in parallel? It’s asking an awful lot.

BFFs (Backends for Frontends)

Typically in this scenario I see the BFF pattern. Each experience is owned by a dedicated team which creates their own API layer for aggregating data from business capability services.

However, wow we’re back to the handovers between frontend and backend teams and there is an even greater risk of logic being leaked and duplicates all over the place because nobody wants to be delayed by handovers.

With BFFs the costs can be even more frustrating for users. When we want to update a business rule that is duplicated in all of the BFFs we will find users complaining about consistency problems like “I just added a bookmark on my phone, but it’s now showing up on the web UI”.

BFFs allow teams to focus on providing an exceptional user experience for a specific device, but they incur many handovers.

Dependencies Between ‘Autonomous’ Teams

Even when boundaries have been designed with due care and diligence and we feel our teams are autonomous, there are still important design decisions that we may have got wrong, and actually our teams aren’t very autonomous at all.

If we design our capabilities naively and too fine-grained, global complexity will be greater. The likelihood of any change requiring modifications to multiple services operated by multiple teams will be higher. Michael Nygard captures the essence of this problem well with his definition of the entity service anti-pattern.

Conversely, when capabilities are too big, the amount of work that can occur in parallel in those areas is reduced unless we have bigger teams. That then has the knock-on effect of increased coordination costs and frequent context-switching, leading to loss of focus and disinterested team members.

A couple of years ago I gave a talk about digital transformation in London. One of the topics I touched on was the importance of finding boundaries that confer greater autonomy to teams.

After the talk, a smartly-dressed man introduced himself to me as the head of development for a local company and explained that his teams had tried three times to find the correct boundaries, and after letting them spend 6 months breaking up the monolithic systems into smaller services, work was taking longer now than when they had the monolith due to high levels of co-change across team boundaries.

At the opposite end of the scale it’s just as easy to create capabilities that are too big. At one client, the business had a huge backlog and uncompromising delivery expectations. “No problem, we’ll just have a team of 12 engineers” they thought, and of course twelve pieces of working running concurrently (actually more because work gets blocked and people pick up new things).

People in the team can’t keep up . They’re being pressured to get their work done, they’re expected to review other people’s pull requests, and they’re constantly blocked or facing merge conflicts. At the same time they are bombarded with notifications and distractions about all of the other work going on in the team.

So the answer is to split the large team into multiple smaller teams? Now we are back to the original problem of having to find the right boundaries and coordinate multiple teams.

So how do we find the right boundaries? Check out my free ebook Designing Autonomous Teams and Services (O’Reilly), some of my talks, or previous articles.

Coupling and Cohesion All The Way Up

No matter how we slice the boundaries and marvel at how autonomous our teams are, we are going to have different services operated by different teams that frequently change together. Information must flow between capabilities, and that means coupling.

Despite all the talk of autonomy, some coupling is a healthy sign. We want different capabilities to combine and provide rich services for our users. However, we can’t just have a flat organisation where every team collaborates with every other team. No work would ever get done.

Alignment vs Autonomy

So there is a balance to find between alignment and autonomy. The more aligned teams are, the more they work towards the same goals at a cost of output they will deliver.

So we have to find a balance that leads to optimal outcomes. Typically we do this is by grouping our teams into layer 2 capabilities (or ‘tribes’ in Spotify jargon).

The concept of aligned autonomy argues that by keeping teams more aligned they can work more autonomously. The idea is good, but we have to acknowledge that it costs time and effort to align teams. Aligned autonomy, while useful, is not a panacea.

This leads to similar design challenges at a higher level of the system - which teams should be grouped together?

Teams in the same group will be organised to communicate more frequently and collaboratively make strategic technical decisions, because we identify that they are working towards similar goals and in related parts of the system.

For example, teams in the same tribe may have joint retrospectives, demo sessions, kick-off meetings, joint quarterly updates, planning sessions together, and so on.

Unscripted collaboration is also something that we should design organisations for — unplanned collaboration that occurs between teams who don’t typically work closely together, opening up the possibility for key revelations and disruptive innovations.

People talk about this beautifully fractal organisation structure, where we have teams, teams of teams, teams of teams of teams etc, all neatly partitioned.

We think organisation structures reflects how teams collaborate

Organisations are Networks

Organisation structure does not accurately communicate coupling and dependencies in the system

Just as layer 1 capabilities (teams) are never fully autonomous nor, are level 2 or 3 capabilities. There is always some cross-over. And the costs of collaboration across a boundary are usually more expensive. So getting the boundaries right at this level is equally, if not more, impactful.

These decisions aren’t easy either. We have the top down business view of teams working towards the same goals. But we have the bottom-up technical view of parts of the system that have a tight technical coupling. Often, one of those factors will dominate system design depending on who makes the decision, rather than a balanced decision being made by consolidating the top-down and bottom-up perspectives of coupling.

When I worked for a large cloud company, one team were building advertising products that allowed businesses to run ad campaigns on social media networks and feed that data back into a CRM system. This is a great example of combining capabilities that an individual team could not produce alone.

Combining those capabilities gave users a rich experience that other companies could not match.

However, these different capabilities lived some distance away from each other in the org-chart and collaborating was much more costly than with teams in the same layer 2 capability (L2 aka ‘tribe’).

The CRM capability was separated at layer 4 or 5 (a completely different business unit located on the other side of the planet with different managment). Collaborating with these teams was orders of magnitude less efficient (Geography was a big part but so were technical, cultural, and organisational aspects).

I’ve noticed that typically, that the greater distance in the org chart between teams, the higher the costs of collaboration (although interestingly, the bigger the potential for explosive innovation).

There are many ways these systems could have been sliced up, but regardless there would always be this inherent cohesion between capabilities that have some overlap.

Autonomy vs Economies of Scale

There is also the constant debate around autonomy vs economies of scale.

At big cloud company, the campaigns L2 had some social media functionality. But there was also another L2 dedicated to social media products. The campaigns L2 also had creatives capability, yet there was another creatives L1 residing in a different L2.

Having these sub-capabilities within campaigns enabled campaigns to have greater autonomy. Yet there was also a duplication of efforts in the organisation. Being highly autonomous is not an advantage if your time is spent duplicating features that already exist within the organisation.

There were so many possibilities for slicing and dicing these teams, each trading off autonomy for efficiency. Improving flow for some types of change and reducing it for others. The costs of a bad decision are non-trivial.

In truth, the organisation was an amalgamation of startups that had been acquired. The system had grown organically into shape rather than being consciously designed for optimal system outcomes.

Shared Services

We also see shared services that cut right across organisations. For example, there was a key initiative at big cloud company to provide a single identity service for the hundreds of teams scattered all around the company. In isolation, that’s not a good or bad thing. It’s another case of trying to find balance between autonomy and economies of scale.

If the org has one big identity service, they cut down on perhaps 50–100 similar services scattered throughout the enterprise. That’s a lot of time, money, and opportunity costs. However, teams now have less autonomy to evolve the service to their needs and have to compromise or wait for features they need to be implemented.

The octopus context: a capability that reaches out and depends on many others from across the organisation

Battle For Ownership

Have you ever sat in a meeting where teams argue over which one of them should own a service? I’ve been in many such many meetings. In some cases, a team will be desperate to own the service, because they then have more autonomy to make changes without being blocked by another team.

On the other hand, nobody wants to own some old legacy ‘piece of junk’ that has a high cost of maintenance and takes their time away from working on high value roadmap items.

Nobody Wants to Own Legacy Balls of Mud

This scenario is especially common with utility capabilities. We have a bunch of user-facing product teams, but one of them has to build and maintain a shared service. For example, a company I recently worked with had to build a new GDPR service. Which team should build and own that?

There were 3 or 4 teams roughly aligned to business capabilities, and GDPR didn’t cohesively belong to any of them. So the responsibility of GDPR was given to a team who had available capacity.

That team now has to live with the costs of maintaining the utility service which detracts from their main feature work. When other teams are shipping exciting new features and getting praise, this team will be slowed down by maintaining the utility service.

Everybody Wants to Control Their Destiny

Working on a project with tight deadlines, I became aware of a political battle that occurs on many IT projects — a fight to take control the middler layer of the system. Each team wants to own this piece in the middle because they get to have the autonomy to evolve it to their needs and time-scales.

On that project, there was a customer-facing and internal-user-facing divide. Both teams desperately wanted to own the APIs in the middle so that they could iterate faster without a dependency and they wouldn’t be the ones who missed the project’s deadline.

The battle lines ran right through the organisation. The head of each department wanted to own the middle so that they would have more power and control in the organisation. They were head to head for promotion to the C-suite, and greater status would give them a better chance of being promoted.

The Selfish Silo

When you are building services for end-users you feel their pain and have a sense of motivation to solve their problems and delight them. Direct user feedback inspires your sense of purpose and drives your priorities.

But not every team has this luxury. Some teams provide generic capabilities or APIs that are used internally. These teams lack direct user feedback and usually aren’t passionate about satisfying user needs.

Product or frontend teams come to view internal-facing teams as selfish silos. Product teams will ask internal teams to help them implement a new feature but internal teams will not always oblige, especially when product teams pop up with new requests or ask for work to be turned around quickly.

The product teams are doing what they think is best for users. They want to quickly implement a new feature to solve a genuine problem, but they depend on backend changes by the internal teams.

The internal teams will grow impatient at product teams constantly firing work at them, changing their mind and asking for short lead times. The internal teams will assume the product teams are impatient and selfish because the internal teams are too far disconnected from the end user.

The more disconnected teams are from user feedback the less likely those teams are to optimise for user needs

One of the worst cases of this was when my team wanted to add a text box to a web page. Users couldn’t enter their full address so they couldn’t register to use the service. We begged the backend teams to update their APIs and schema to add the new field, but they were too busy “maybe sometime in the next 12 months” they said.

We were powerless to solve our users’ problems and the internal teams showed little compassion. They thought we were impatient, but we just wanted to solve user problems.

Counter-productive Incentives

This misalignment is amplified when we reward those internal teams for velocity. Why should they stop what they are doing to help 1 team put a text box on a web page when they will be judged on priorities they committed to at the start of the quarter. By helping another team to solve user problems they will be punished!

We can try to avoid these situations by shaping our boundaries so that each team gets direct user feedback but it’s unlikely to be possible. We need internal systems and shared services. We have to solve this problem also with better alignment and organisational resilience.

Living With Cohesion and Connascence In Sociotechnical Systems

As system designers, we are studiously trying to put boundaries around things that change together, in software and in our organisation. By encapsulating cohesion, work is completed an order of magnitude faster and our chances of achieving high-performance continuous delivery are statistically greater.

Unfortunately, systems rarely, if ever, can be neatly-organised into fractal cohesive groupings of fully autonomous teams. There are many fundamental challenges and no perfect solutions for finding boundaries.

Wherever we place the boundaries there will be consequences relating to:

  • UX consistency
  • Improved autonomy for some teams and reduced for others
  • Speed of delivery
  • Ability of work to occur in parallel
  • How overloaded and motivated teams are
  • Tensions and politics between groups
  • Technical complexity
  • Defect rates

And finding the boundaries is only a part of the solution. High-alignment and resilience are key for organisations that need to continuously adapt.

We need multi-skilled, sociotechnical architects who can look at problems from diverse perspectives and coach to collaboratively design and evolve sociotechnical systems. Since you’ve read this far, you have the passion to be one of those people. Good luck on your journey.

If you learn best through hands-on workshops, I’ll be teaming up with flow addict and talented product owner Zsofia Herendi to run the following workshops:

- Strategic Autonomous Design Applied (Denver)
- Strategic Autonomous Design Applied (Berlin)

Contact me for more information or for on-site consulting and workshops. You can watch some of my talks if you want to learn more.

--

--

Nick Tune
Strategy, Architecture, Continuous Delivery, and DDD

Principal Consultant @ Empathy Software and author of Architecture Modernization (Manning)