This post is part three (and the last) of reflections on a conversation I had with Chelsea Troy about our testing and development heuristics.
I asked Chelsea, how do you get people to be less clingy about holding on to tests that don’t add value?
She speculates that this will require getting people to unpack the psychological baggage they hold around the value of their tests or code. Once people have written tests or code, they don’t want to get rid of them because that is the primary visible evidence that they’ve done something. Developers rarely get recognized for removing unused code or throwing away brittle or inconsequential tests.
So how can you turn this attitude around? Let’s be blunt. Developers do lots of things to improve their ability to write and maintain code. They shouldn’t be penalized for these activities. Professional software development isn’t just cranking out code.
A Heuristic for Switching Things Up
I learned this guiding heuristic from a colleague, John Schwartz: If something you are doing isn’t buying you any new or useful information, stop doing it. Instead, move on to something that will. And, if that doesn’t work, try something else. Don’t settle.
John applied this heuristic to every aspect of software development and management. Applied to testing: If tests always pass (or frequently fail because of nitpicky things that don’t matter), throw them out. When you fix brittle UI tests, only to find that they fail with the next CSS style change, recognize that you aren’t making forward progress or learning anything new. Repeatedly fixing those brittle tests is simply busywork.
You are better off acknowledging that you are testing at the wrong (too low) a level and that to buy information you should be testing differently.
So, what happens when you throw out those tests?
Arguably, you will have less clutter and useless information to wade through when something breaks your build. And if you do miss some of those tests, you can always bring them back. Or decide to run them periodically.
I have seen clients hold on to coding and testing practices that slow them down. At one client, developers frequently left unused code as comments so they wouldn’t forget about it just in case it might be needed. When asked “Why not just use version control?” they worried that they would forget about the code. But how long had that code been frozen in comments? When’s the last time you unthawed some commented code and used it as is?
In another situation, I once reviewed some code for a function which had a parameter that was never used. At one time it was; but now that code was obsolete. Every new team member struggled to understand the reason for that parameter and what the code that handled it did. I tried to convince them to remove this useless code. It would take less than 20 minutes. I’m not sure whether they did. But at least I got them to recognize that this dead code was an impediment. Before I pointed out this out, they had just considered the time it took to understand that code to be an annoying rite of passage for new developers.
Documenting Your Actions
Chelsea is experimenting with ways to simplify and minimize code and tests. The more stuff you have to wade through, the harder it is to maintain and debug your code. Reducing extraneous stuff–whether it is overly complex code, unused code, or awkward tests–makes life easier. She is also trying to get her team to write commit messages to document these efforts. She wants to leave a documentation trail. Her heuristic: When you pare down of code and tests, document both what has changed and why.
Using commit records as documentation won’t work well unless everyone follows conventions. But is it realistic to expect everyone to be as diligent as Chelsea at documenting their work? Developing software is a team sport. Unless we agree on how to work together, and then follow through on our agreements, results will be inconsistent. Even so, people still bend or disregard working agreements. I suspect there are several reasons for this: maybe they didn’t fully buy into the practice, or maybe they didn’t understand or appreciate the reasons behind what we agreed to, or…
When this happens, whatever the reasons, something needs to change. And we won’t know exactly what’s going on unless we talk to each other.
Chelsea suggests that we could benefit greatly from uncovering each other’s assumptions instead of simply letting things slide. How can you do this? Be direct: Ask people to think really hard and to openly share their thoughts and values. Then decide what actions to take.
Getting Maximal Value out of Documentation with Minimal Effort
Chelsea sees a similar a problem with software documentation. No one wants to write it. But they do so because they think they should. But they don’t bother updating existing docs. So, documentation gets out of date and they end up with lots of inaccurate versions. I’ve seen this problem many times throughout my consulting career. As I know I’m one of those rare individuals who actually likes to write documentation… I know that you can’t fix this problem by telling developers to just try harder. (Again, rarely do developers get rewarded for writing documentation).
One useful heuristic for mitigating your out-of-date documentation problem is to create “living” documentation (instead of static documentation that is written once and never updated). Cyrille Martraire has written a great book, Living Documentation, that contains many heuristics (written as patterns), and examples demonstrating code to create it. Living documentation is generated by executing scripts that extract facts from your codebase or running system or repositories. Then, using that information, those scripts dynamically generates or updates your documentation. A central value underlying Cyrille’s heuristics is to make documentation integral to software development, rather than it being a separate activity. If you connect your documentation directly to your code, it will always be up-to-date with the latest code. He suggests starting small, then growing tooling and scripts as you find the need. The guiding heuristic behind all of Cyrille’s heuristics is: Don’t let software documentation become stale; proactively generate documentation with information that can be automatically refreshed.
Efforts to create living documentation, streamline software development, or making software codebases more sustainable, are often undervalued. They shouldn’t be. My guiding heuristic: Cut out the crap so you can focus your development attention on the important stuff.
This post is part two of some reflections on a conversation I had with Chelsea Troy about our testing heuristics. You may also want to read part one and Chelsea’s writeup.
I shared with Chelsea how my Smalltalk development background contributed to my testing and design heuristics. I was involved in the early days of Smalltalk at Tektronix as a principal engineer in the AI Machines group. After a yearlong stint managing the software group through product introduction, I switched back to full-time engineering. Among other things, I added features to Smalltalk including color graphics, fonts, and support for low-level OS calls. All our code was visible to our users, and we had a strong engineering culture.
I learned how to work effectively in the Smalltalk environment by studying existing code, figuring out what it did, and understanding its coding and design style. I also observed more experienced Smalltalk programmers. Kent Beck, along with Ward Cunningham, and other Tek Lab’s folks were some of the very earliest Smalltalk application programmers. Ward and Kent worked together, developing prototypes and exploring what Smalltalk was good at. Many ideas about Extreme Programming (and TDD) and object-design can be traced back to these programming experiences.
The Smalltalk image was always running. It contained the entire development environment and had a browser where you could look at existing code and add your own. Much of my time was spent experimenting with and reading existing code, then trying to fit my new code in. The code I wrote was a mix of new classes as well as extensions or modifications to existing ones. To show someone else how to use your code, you’d create a workspace—a scratchpad window—and put snippets of commented code for them to read, edit, and evaluate it. By convention, methods were categorized (see the third pane of the System Browser below, which shows the categories for the abstract Collection class). Other classes also had a testing category, but it was not used for what you might think! The testing category for the class Collection included methods for querying (i.e. testing) its contents.
So how did programmers test Smalltalk code? I didn’t have any conventions to follow for organizing my tests (and not inconsequentially, leaving test code around would clutter up the Smalltalk image). Since I could highlight code anywhere and execute it, I tested code as I wrote it. I could step through code with a debugger, change it on the fly, and run it again. I tested my code into existence, but didn’t leave around any tests.
In an article I wrote about Color Smalltalk here’s how I described this experience: “… the workspace, lets programmers experiment with code without actually incorporating the experimental code into the valid, running environment. A programmer can write, execute and debug code in a workspace, then pull it into the Smalltalk application when the new code is tested and operational.”
While this statement is mostly true, it is also misleading. Anything I did as a programmer would add more objects to and change the state of the running Smalltalk image. Code you executed in a workspace changed the image (sometimes with catastrophic results, especially if you were tinkering with basic low-level system functionality as I was). But the Smalltalk environment and tools made it so easy to back up a step or two, revise your code, and try again, even with code that mucked with low-level stuff.
Kent’s Smalltalk experience heavily influenced how he thought about incremental development. But when it came to testing, I suspect he tried to boil down his Smalltalk experiences into practices that would be more “failsafe” for programmers who didn’t work in such a dynamic and forgiving development environment. Kent’s thinking about testing has evolved since he wrote his books. In an interview with Andrew Binstock in 2019, Kent and Andrew chat about this evolution:
Binstock: Do you still work on strictly a test-first basis?
Beck: No. Sometimes, yes.
Binstock: OK. Tell me how your thoughts have evolved on that. When I look at your book Extreme Programming Explained, there seems to be very little wiggle room in terms of that. Has your view changed?
Beck: Sure. So there’s a variable that I didn’t know existed at that time, which is really important for the trade-off about when automated testing is valuable. It is the half-life of the line of code. If you’re in exploration mode and you’re just trying to figure out what a program might do and most of your experiments are going to be failures and be deleted in a matter of hours or perhaps days, then most of the benefits of TDD don’t kick in, and it slows down the experimentation—a latency between “I wonder” and “I see.” You want that time to be as short as possible. If tests help you make that time shorter, fine, but often, they make the latency longer, and if the latency matters and the half-life of the line of code is short, then you shouldn’t write tests.
Binstock: Indeed, when exploring, if I run into errors, I may backtrack and write some tests just to get the code going where I think it’s supposed to go.
Beck: I learned there are lots of forms of feedback. Tests are just one form of feedback, and there are some really good things about them, but depending on the situation you’re in, there can also be some very substantial costs. Then you have to decide, is this one of these cases where the trade-off tips one way or the other? People want the rule, the one absolute rule, but that’s just sloppy thinking as far as I’m concerned.
Practicing TDD ensures developers write tests. The underlying value heuristic is, “any tests are better than no tests.” But if we take Kent’s more recent thoughts to heart, we shouldn’t test without thinking through some consequences. Kent’s more recent guiding heuristic: Test when it matters and when you need a safety net. Think through both the benefits and costs of testing. If you are exploring, don’t let testing slow you down.
There is no single “definitive” answer to the question, “when should I test?”
Develop Test Strategies Based on System Context
Chelsea asked, “So, how do you determine what kinds of tests to write?”
I don’t have a definitive answer to this question, either. So, I shared a few stories. I’ve worked with clients unschooled in TDD. They write code, test it a little, and then throw these initial tests away. They build successful products. The tests they tend to keep are regression tests, tests that demonstrate a quirky bug that has been fixed (and to ensure that it stays that way). It’s always a bet.
If code is stable, and the tests always pass, running tests all the time isn’t buying any new information. Even worse, passing tests can give you a false sense of security about your code’s quality. So why do we write tests?
I like to focus on writing tests that check that stable (relatively unchanging) system expectations still hold, and that demonstrate ways new capabilities can be safely added. I also try to write tests that capture expectations I have around my system’s behavior.
For complex systems, though, this can be difficult. Unforeseen side effects can pop up in strange places (changing code in one place unexpectedly causing other code to break in a distant part of the system). It’s impossible to test for every possible edge case and you don’t know all the dependencies.
I remember Kent Beck telling this oddball story of writing his first TDD code when he went to work at Facebook. His code, which passed all his tests, suddenly caused other tests for other parts of the system to fail. Rather than revert his code, those familiar with the system decided to throw out those failing tests. Seems weird, but they knew those tests were brittle, and making some wrong assumptions. When you find problems with tests, think carefully about whether it is appropriate to add additional tests to ensure that things don’t break, whether your existing tests are brittle, or whether your assumptions are wrong.
Data Scientists Have Different Testing Values
When you need to process massive amounts of data, and the code for processing of that data is predictable, there is little value in repeatedly running functional tests that always pass.
I worked for a number of years for a client doing healthcare analytics on patient medical data. Sometimes, their heuristic for verifying an algorithm would be: test that the new code works by comparing its results against code written in an entirely different system/programming language. They would take a massive cut of the data and run it through and compare the results.
Another heuristic they sometimes used to test new algorithms and capabilities was to run their code and compare their results against those reported in published research papers. Where the results differed, they need to reason about those differences (sometimes it was a problem with their code; at other times, it was that their code was more accurate at choosing cohorts or their statistical algorithms were better). Some person needed to critically analyze the results, reason why the discrepancies were there, and determine what, if anything, to do about them. This process couldn’t be automated.
Chelsea works with data scientists at Mozilla on sanitizing personal data for searches. The rules for this are complicated, language-specific, and sometimes people enter search terms in more than one language. She finds data scientists don’t share the same testing values as many software developers do.
Data scientists make informed assumptions about aggregated data. If those assumptions don’t hold, they reassess the data processing rules and revisit their assumptions. To them, testing is insufficient to ensure system quality. Monitoring actual system behavior against expected data characteristics, however, is critical. When the data characteristics being monitored fall outside of expected tolerances, this triggers developers to look into the situation. Developers then run some automated tests to determine if something is wrong with their code. If those automated tests pass, they then call on a data scientist to analyze a sample of the data and decide what to do. Something has changed and there likely needs to be some change to either in the assertions about the data’s characteristics or the rules for handling it.
Trialing new Heuristics
Chelsea and I appreciate what we can learn from people with different backgrounds: data scientists, QA folks, testers, and new colleagues. There are many different ways to test and design software. And if I don’t hold onto my preferred heuristics too tightly, I might learn something.
But how do I decide when to try out some new heuristics or to stick with what I know?
If things are going well, I’m not as motivated to try out new ideas. I need a small nudge. I’ll try new some new-to-me heuristics if I feel I have some wiggle room. Let me experiment, practice, and think through the consequences. Give me a bit of time to let new values and practices soak in.
When I start to work on a new system or folks from different backgrounds, that too, is another opportunity to try out new ways of working.
But under pressure, I find myself narrowing my focus and sticking with what I know best (even if it is a poor option). So, if I can, I catch myself and take a small step back from problem solving. I pause, take a breath, and ask: If my heuristics aren’t currently working for me, what are some options?
If I want to introduce a test-first TDDer to my testing approach, I might suggest a modest experiment: “Let’s work together on some design and coding problems and compare our two approaches. Let’s find out what tests we come up with following my test-driven development approach. Let’s try your test-first TDD on a similar problem and see what tests we come up with. Let’s see what we learn.”
At the very minimum, I hope we’d learn of our shared value: we both value tested code. We might learn from each other more about the kinds of tests we like to write. Or how many tests we think are needed. Or how we rework existing tests. We might share some heuristics for deciding what next test to write or what isn’t worth testing. Through experimentation and reflection, we can grow and learn from each other.
Recently Chelsea Troy and I chatted over Zoom about software testing heuristics. I met Chelsea last year at DDD Europe. In this and a couple of snack-sized posts, I will reflect on some highlights of our conversation. Chelsea has also written about our conversation.
A Leading Question Leads to Some Heuristics
I started by asking, “What is important about testing that people should get but don’t?”
Chelsea answered, that while Test Driven Development (TDD) is useful, it doesn’t solve all testing needs. If developers are oversold on the benefits of TDD, they can become jaded on testing in general. They shouldn’t. TDD doesn’t include specific practices that address resilience, or reliability. But it is useful for developing and testing deterministic code.
Chelsea shared the experience of learning first-hand how TDD didn’t have all the answers to testing. She worked on a team of TDD enthusiasts developing a mobile app for a client. Although the team thought they knew how to develop quality software, their initial prototype developed following TDD didn’t address these challenging requirements: being usable under extreme weather conditions, having a simple UX, and functioning when only intermittently connected to the internet and their backend software. They needed to add more design and testing techniques to their toolbox, along with their TDD testing. Chelsea also said that she learned a lot about testing for these kinds requirements from their client’s QA team.
Some heuristics we’ve touched on:
Use TDD to develop and test functionality of deterministic software.
Use other strategies to design and test for software system qualities such as usability, performance, reliability, or resilience.
Match your testing strategies and tactics to your application’s development and execution context.
A Brief Introduction to Heuristics
I have been intrigued by software development heuristics, ever since I read Billy Vaughn Koen’s Discussion of the Method: Conducting the Engineer’s approach to problem solving. Koen defines a heuristic as, “anything that provides a plausible aid or direction in the solution of a problem but is in the final analysis unjustified, incapable of justification, and potentially fallible.” Heuristics are never guaranteed. When a heuristic fails, you back up and try another one.
I enjoy hunting for heuristics while designing and coding with others. Open-ended conversations where we swap stories and reflect on our heuristics are another great opportunity. Generally, I look for three kinds of heuristics:
Action heuristics. Things we do to solve our immediate problem. There are many action heuristics. Design patterns are one well-known form of action heuristic. We know these heuristics by name because authors took the time to write up them as named software patterns. But there are many testing and development techniques both smaller and larger than patterns. For example, in Test-Driven Development (TDD), the practice of “write a test, then write code to pass the test” is a heuristic for incrementally designing and implementing tested code.
Value Heuristics.Values motivate our actions. Underlying TDD is the value: Testing should be an integral part of design and coding.
Our values determine what actions seem appropriate. Because I value understandable code, I I take several actions to make my code more comprehensible: I give methods, functions, and variables meaningful names; keep code in methods short; and write code at the same level of detail in a method, factoring out lower-level details into helper methods.
Values depend on context. As the context shifts, so do our values. This doesn’t mean we are fickle; just pragmatic. Most of the time we aren’t conscious of making these shifts. When cutting and pasting code from stack overflow, I don’t value code understandability so much as I do the ability to quickly determine whether that code addresses my current problem. If it does, then I rewrite that code to make it clearer and to fit with the style in my existing codebase. In production code, I do value understandability.
Guiding heuristics. Heuristics that lead to related actions. For example, Chelsea shared one guiding heuristic: Don’t treat test code the same as production code, instead, make each test understandable in isolation. This leads her to write self-contained test methods. She doesn’t like a test where she has to read the code that it calls on before she can understand the test. She also isn’t a fan of applying the DRY (Don’t Repeat Yourself) heuristic to test code.
Comparing competing heuristics
Chelsea mentioned that understandable tests can also serve as valuable design documentation and discovery tools. It’s easier to modify test code that is self-contained, rerun it, and explore how the software responds.
I asked Chelsea whether she would put aside her heuristic of keeping tests self-contained if there were compelling reasons. What if set up conditions for tests took a long time (for example, doing a cut of a database in order to build an in-memory cache of test data)? What if there was complex code that was repeated in similar tests but was slightly altered? Did someone make cut-and-paste-modify-and-reuse errors, or were there valid reasons for these differences?
Factoring common initialization code out of tests into common setup code, provides a “standard” execution context for a suite of tests. It also makes it easier to vary that context and rerun the test suite. Factoring out code common to several tests and clearly labeling what it does eliminates having to second guess reasons for slight variations in test code.
Depending on your situation and personal preferences, you may choose the heuristic, “Keep code in tests so you can understand and easily manipulate it,” or the other, “Factor out expensive or error prone code into common code shared by tests.” These heuristics compete with each other. Neither is better. They are simply alternative ways to structure your test code.
The Value of Knowing your Values
If people don’t know your values (and how they differ from their values), they may not understand why you prefer to work the way you do. For example, while I value testing, I don’t practice test-first development.
If you understand TDD to mean strictly write tests before writing any code, your TDD heuristic is: begin by writing a small test, then write code that proves that the test fails, then rewrite your code to pass that test. Don’t add any more code than necessary to make the test pass. Do this repeatedly until you’ve fully implemented your code.
At the end of a TDD cycle, you have a bunch of tests and fully functioning code that passes those tests. Working this way, you typically implement a single class at a time. You test and implement lower-level functionality, then repeat the process to develop the code that uses that functionality. Your software tends to grow from the “bottom” up.
I value testing, but typically design and implement several classes that work together at the same time. Once I prove to myself that my overall design hangs together (through some sort of simulation), I implement it. When finished, I check in code for several classes along with tests that demonstrate their behavior. My code is tested, but I don’t leave around lots of low-level tests.
For example, I may use a strategy pattern to calculate charges for different items on an invoice. I would initially implement each individual strategy class and check that it worked as I expected. But I’d remove most if not all tests for those individual strategies once I proved to myself that they worked. Their code is simple enough to read at a glance. Once I get low level classes working (especially if they don’t retain any state), I don’t need to keep tests around to ensure that they work. Once implemented, they rarely change. If I do need to revise them, at that point I might reconsider my testing heuristics (and add some tests that reflect these changes). The valuable tests I tend to preserve are those that determine which strategy to use, how to add new kinds of strategies, and different ways to apply discounts and special pricing.
Let’s contrast my testing heuristics with those of test-first TDDers.
We both share this value heuristic: Value code that has tests over code (even if it works) that doesn’t have tests.
Test-first TDDers apply this heuristic: Write tests as you incrementally design code. Interleave testing and coding, repeatedly. Start with the simplest test and the simplest implementation. Only implement enough functionality so that your latest test passes. Build functionality and tests in small increments; each increment moving you closer to your final tested design.
They also have this guiding heuristic: You produce a cleaner design if you write tests first before writing any code.
I don’t share that heuristic.
My heuristic for developing designed, tested code is: Consider the design of one or more classes working together to achieve some functionality. Model your design using some lightweight technique, such as CRC cards (Class-Responsibility-Collaborators) or whiteboard sketches. Once you know what each class’ responsibilities are and how they interact, then implement them. Write simple tests and debug as you implement, but remove them if they are low level (and other code that has tests exercises their functionality). Keep only a lean set of illustrative tests that demonstrate how the classes work together and ensure that your design will continue to function properly.
At the end of my design/development cycle, I may write additional tests, revise existing ones, or remove insignificant tests. I use this grooming and cleanup step, before committing my code, as one way to double check my work.
Chelsea summarized my TDD heuristic as: Put tests in at the right level of abstraction once you know what your design is about.
Chelsea cautions, however, that if you don’t know what the right level of abstraction is and you follow test-first TDD heuristics by rote, you end up with tests at too low a level. Also, if you don’t have heuristics for pruning them, you end up with too many.
I view most testing I do while I implement my design as temporary scaffolding. Since I’ve already sketched out design ideas before coding, tests are not my primary tool for design. I test to verify my design. If I need to adjust my design as I implement it (and I expect to), that’s OK. I keep tweaking it and my code, and continue testing.
I suspect the biggest difference in our two approaches is that test-first TDDers don’t view their tests as temporary scaffolding, and I don’t view the cycle of test-first TDD as the only (or best) way to understand what a design should be. We both value tested, well-designed code.
Bringing to light the different values that underlie competing heuristics can be illuminating. But how can we get others to appreciate and try out our heuristics? How can we approach new-to-us heuristics with an open mind? I’ll touch on these topics and more in my next post.
Our worldviews are grown from other people’s models. How do we control what models we let in?
You might be familiar with The Five Stages of Grief aka the Kübler-Ross model for processing grief (denial, anger, bargaining, depression, and acceptance). Independent of what the authors intended, this model gives us a framing right off the bat:
The word “Stages” suggests that they are discrete phases.
It also suggests that these stages come in a fixed order.
The word “The” suggests that you always go through these five stages when grieving.
And it is our impression that people interpret this model as prescriptive: in order to process grief, you must go through these stages in order
We’re not interested here in whether these statements are accurate or not. Perhaps the authors’ wording was accidental, and it could just as well have been named “Various Feelings of Grief.” We’re interested in how the language suggests ordered stages, and that’s how most people will perceive it.
When a model puts us in a certain framing, how can we tell? How do we understand what framing a model imposes on us? How do we engage with any model? How can we evaluate it? Should we break out of that framing? Do we accept the model as is? Or do we use it as scaffolding for finding better models, that are better suited to our situation?
Models Are Worldviews
Models, whether for a software system, a development process, diseases, political systems, or otherwise, are a way to look at (a part of) the world. They explain how something behaves and how to modify that behavior. Every model makes a choice about what is important, what categories we classify things in, what we see, what’s invisible, what’s valued, or even what’s valid. Models are reductionist; that is, they only show a selection of the subject they’re describing, and lose something in the process. And models are biased: they implicitly reflect the assumptions, constraints, and values of the model’s author.
Most of the time, when you adopt a model created by someone else, you assimilate it into your worldview without much thought. You acquire a new way of seeing something and accept it. It’s how learning works. However, when you do that, you may not understand the model’s limitations.
Models mess with you. They impose a distinct perspective, and a set of rules for operating within them. They frame how you look at your problem. And you’re usually not aware of this perspective and framing.
But you can choose to look at a model more intentionally. If you’re looking at a model for the first time, you can use your fresh perspective to see what it includes and what it leaves out. You can critically assess whether that model fits your view and your needs. Models are a powerful lens for perceiving a subject, and you should be deliberate when wielding them.
We’d like to share some tools for critically evaluating any models that come your way. We’ll do this by example. We have picked a number of organizational models to illustrate how to examine them and see whether they offer what you need to solve your problems. First, we’ll discuss the hierarchical, social network, and the value creation models. Then, we’ll examine three organizational models that are specific to software development: the Spotify Model, the Agile Fluency Model, and Team Topologies.
Our goal is not to voice an opinion one way or the other about these different models, or to suggest which ones to use for your organization. Our interpretation of these models is by necessity brief. If you are familiar with any of them, you might feel we do them a great injustice by our summarizations. We ask that you look past the specifics of the organizational models we examine, and instead focus on the methods we demonstrate for evaluating any model, in any context.
Here we use well-known published models to illustrate approaches for understanding and comparing them. But these techniques apply equally well to various models that you may encounter or create, such as domain models, working agreements, business processes, or politics.
Hierarchies and Networks
A traditional way of looking at organizations is the hierarchical model. In this model, power is concentrated at the “top” and authority is structured into levels. People talk about “going up the chain of command” in order to reach the “right person” in the hierarchy who has the authority to make a decision. People have often mistaken this model for the whole: “an organization is a hierarchy”. In reality, the model only describes one aspect of an organization.
It’s often by looking at alternative models, that we see the limitations of this hierarchical model. For example, a competing model says: “an organization consists of a hierarchical control structure and a social network”. The social network is who you talk to, and how you informally acquire information you learn at the coffee machine or over lunch. This model is an improvement over the hierarchical model: it shows us that human interactions can affect the functions of the organization. A manager might want to encourage their organization’s social network, or change the dynamics to improve opportunities for informal interactions. They’ll try to improve the functions of the organization, and make sure the social network doesn’t hinder it.
In our role of model evaluators, this teaches us that an organizational model can have two networks operating simultaneously, each representing a different function of the organization. That’s significant, because if our organization has two networks, it is possible to have more.
Heuristic: If there are two ways of looking at something, look for a third.
Another model augments these two organizational networks with a third network: the value creation network. This is the network you turn to to produce meaningful outcomes. It’s knowing which people to ask when you need something done, who the experts are, who can execute it. By modeling our organization now as a hierarchy, a social network, and a value creation network, we are challenging the assumptions made by the previous two models. The new model says: “value creation doesn’t happen through the same channels as control or gossip.”
Before we adopted this third organizational model, there were two possibilities: either value creation was not on our radar at all, or value creation was a responsibility of one of the other two networks (most likely the hierarchy).
When we accept this third model, we believe that value creation:
is important enough to be recognized as an organizational responsibility;
doesn’t happen in a hierarchy or social network;
it happens in its own network.
This new model doesn’t simply add a third network. The new model supplants the prior belief that value is created in the hierarchy, with the belief that it is created through a value creation network. Through the addition of the third network, the other two networks in the model have acquired a new purpose. The third network adjusts your perception of the other networks.
Say a manager has been mandating that all communications and decision-making happens along the lines of the hierarchy. Now, as they accept the new model, they change how they operate, and instead allow and encourage social and value creation networks to thrive. The manager will no longer demand that all communications take a certain shape, and will choose not the exert control over all aspects of communication.
None of these three models capture reality. Instead, they capture beliefs about reality that, in this case, drives business decision-making. Each model alters the way we perceive reality, and how we take action based on that perception. By accepting any model, we accept the belief system that comes along with it, and we change our behavior accordingly.
From our short analysis of three organizational models, we can derive several heuristics for helping to evaluate any organizational model:
Heuristic: Compare different models to figure out what one model adds or omits, emphasizes, or downplays.
That is, use one model to discover if there are any gaps in the other. Look for ways one model might give a lot of attention to one aspect and not to another.
Heuristic: Understand the underlying belief system that comes with the model.
If I accept this model, how does it change my belief about part of the world that this model addresses?
Heuristic: Determine whether the model confirms my existing belief system and values, or conflicts with them.
Am I choosing models based on how well they conform to something I already believe?
Heuristic: Understand whether the model addresses or solves a problem that I’m interested in solving.
It may be an appealing model, but do I really need it?
Heuristic: Ascertain whether the model points me to problems I might have but haven’t yet considered.
We find these heuristics are useful for critically evaluating any model, not just organizational models. In this text, however, we’ll stick to organizational models.
As we have shown, a model creates a lens, a way of looking at the problem. Additionally, models often come with a richer set of guidelines or instruction sets, that tell you how to use them.
The organizational hierarchy model, for example, comes with advice on how to achieve a more desirable organizational structure, such as “Don’t have more than 6 levels,” or “Have between 5-30 direct reports per manager,” or “When your organization gets too big, split it along these function.” These guidelines are congruent with the building blocks of the hierarchical model and with the belief system that underlies it.
Guidelines are useful for implementing this model. Whenever you embrace a model, its instructions lead you to considerations that it views as important. It focuses you on the changes that it wants you to make. It’s saying “this is an important thing to worry about”. The things it finds important have names: the levels, the functions, the reports, … Later, we’ll look at models that don’t include names for levels and reports, but do include names for teams, value streams, fluency, or evolution. The choices of things a model names are what puts us in a frame.
But by discussing “How many levels do we need?” we’re distracted from asking the deeper question of “Do we need levels at all?”. Because the hierarchy model comes with a name for levels, it imposes the need for levels.
The methods and tools that come with any model are insidious in that sense. The presuppositions that a model makes are:
The model is adequate (an organization is indeed a hierarchy);
The tools and techniques are adequate (choosing the right levels is the right solution for organizational problems);
If you fail, you haven’t applied them well (choosing not enough or too many levels).
Models rarely include a technique for determining whether the model fits your environment.
If we look at the flat organizational model, we see that it also makes some presuppositions (which conflict with the hierarchy model):
The model is adequate (an organization should indeed be flatter);
The tools and techniques are adequate (removing middle management is the right solution to improve your organization);
If you fail, you haven’t applied them well (you’ve allowed too much of the old hierarchy to survive).
But, again, if we only discuss techniques for removing middle management, we can forget to ask if any functions of middle management are not being handled with a flat organization model. The flat model doesn’t tell us to look for valid functions of middle management, and it doesn’t provide a pattern language for understanding these functions.
Models help you ask questions, but not necessarily the right ones. If we compare different models however, we can take the questions from one model, and ask them in the context of another. By looking at their difference answers, we can see where they diverge. And in the case of the hierarchy and flat organizational models, we can see they have diametrically opposing world views. Let’s look at some software development specific organizational models next.
The Spotify Model
A popular model for structuring organizations is the Spotify Model.2 Again, this model introduces elements not found in previous models we examined. Roughly, the model consists of Squads (cross-functional autonomous teams that are responsible for a discrete part of the product, led by a triumvirate of a Product Owner, Tribe Lead, and Agile Coach), Tribes (for coordinating multiple Squads), a Tribe lead that coordinates with other Tribes, and Alliances (for collaboration across Tribes). Then you have Chapters (to deal with shared interests, such as standardization and preferred practices in a field). Finally, a Guild is an ad hoc grouping of people interested in learning a specific topic.
Although the language is different from our previous models, there’s still a clear sense of hierarchy in this model, but it’s fairly shallow. This probably explains why it’s a popular model: it appeals to managers who believe some form of hierarchy is critical to delivering anything, but it doesn’t emphasize hierarchy at all.
The interesting thing about this model is that it introduces two styles of learning into organizational models: one for broad institutional learning (Chapters grow the skills of the organization as a whole and capture that in standards and common practices), and another for individual upskilling (Guilds). Both are knowledge creation networks. All organizations have a need for learning, of course. But by explicitly naming Chapters and Guilds, the Spotify Model is saying that learning is important enough that you should deliberately structure your organization to support it. It says that creating the knowledge network should be on people’s radar as a distinct responsibility, not just an accidental by-product. Underlying this model is the belief that knowledge work requires structural support.
By adopting the Spotify Model, you’re accepting that you need some hierarchy, but that it should be limited; and that besides value creation, individual and institutional learning are critical problems to organize for.
When you first learn about the Spotify Model, you accept implicitly that learning is an important concern. But if instead of merely accepting this model, you actively compare it to other models, you can ask: “Why is it that value creation and knowledge are considered important? Which other aspects does this model exclude, and why?” Comparing models makes it easier to see what each model brings to the table.
How do other models solve for growing organizational knowledge? For example, some organizations introduce Centers of Excellence. These are standalone teams with dedicated skilled members. This is different from Chapters in the Spotify Model, where the members of a Chapter are also members of different Tribes. Because we are aware now that we can organize for knowledge, we can actively seek out other models, and compare how they address it.
Let’s take a quick look at another model which is focused on learning. Agile Fluency is an organizational model that represents teams and organizations as being in different stages of their progress in becoming agile. These stages are defined by the activities they do to improve their capabilities and value to the business.
Time to Fluency
Greater visibility into teams’ work; ability to redirect.
Team development and work process design.
“Scrum, Kanban, non-technical XP”
Low defects and high productivity.
Lowered productivity during technical skill development.
“Extreme Programming, DevOps movement”
Higher-value deliveries and better product decisions.
Social capital expended on moving business decisions and expertise into team.
Cross-team learning and better organizational decisions.
Time and risk in developing new approaches to managing the organization.
Organization design and complexity theories
Image and table: Diana Larsen and James Shore, agilefluency.org
The picture alone doesn’t do Agile Fluency justice of course. The underlying assumption of Agile Fluency is that when your organization isn’t producing enough value, you can solve this problem by first improving individual teams’ capabilities and ways of working, and then restructuring the organization itself.
Like the Spotify Model, this model addresses learning as an explicit organizational need (as opposed to something you tack on). Unlike the Spotify Model, however, Agile Fluency is not a model for structuring your organization. It does recommend that you address your organizational structure, but leaves that to other models. The scope of the Agile Fluency Model is about agile culture and skills.
What this model does introduce is the idea that teams, and therefore the organization, progress over time. This notion of progression is not present (or in any case it is not very obvious) in the previous models we have discussed. It’s not that those models force you into stasis; you can always change things around. But the previous models don’t have the building blocks for representing progressive change. You’re just supposed to change when needed, with no support from the model, and no language for explicitly reasoning about change.
In Agile Fluency, you can impact progress by choosing the appropriate interventions (focusing, delivering, optimizing, strengthening) at the right time, in the right order. The model’s worldview is that doing these activities out of order is not going to yield the results you’re looking for, because capabilities depend on previously acquired skills.
Now that we understand how the building blocks of Agile Fluency represent progress (instead of structure), we can use this view to compare it to other models.
The Spotify Model allows for evolution (you can split up teams and reorganize them), it allows for learning and encourages it by introducing structure for learning. But it stops short at explaining how to use that learning to progress. What should progress or evolution in the hierarchy model, or in the Spotify Model, or any other look like? The answer you come up with, leads to the next question: do I care about progress in my context? Maybe your situation doesn’t call for progression. If it does, you’ve learned something about the gaps and assumptions of the model you use.
Heuristic: Compare different models to understand what you actually care about.
Heuristic: If a model doesn’t explicitly address something you value highly, try fitting in something from another model and see if the model still hangs together.
Another interesting aspect of the Agile Fluency model is that it has an aspirational aspect, and admits this openly. True fluency playing a musical instrument is achieved over a very long time, and after a lifetime you can still learn more. Similarly, with the Agile Fluency model there is no achievable end state, but you should still strive to get in the strengthening zone. It is worth increasing your skills and capabilities, even if you never obtain all its benefits.
Image: Henny Portman
Team Topologies is a software organizational model that focuses on fast flow and value creation. It advocates composing your engineering organization from four types of teams and three types of team interactions. You should aligned teams on a common purpose, and reduce their cognitive load, so they can be efficient and focused. This model explicitly values reducing unnecessary cognitive load. The patterns (or topologies as they call them) are Stream-Aligned Teams, Enabling Teams, Complicated Subsystem Teams, and Platform Teams. These teams interact using three communication “modes”: Collaboration, Facilitating, and X-as-a-Service.
When you have teams that don’t map directly into these patterns, they “should either be dissolved, with the work going into stream-aligned teams or converted into another team type … [and] the team should adopt the corresponding team behaviors and interaction modes.”3 Of course there’s more to Team Topologies than our summary, but team types and communication styles are the building blocks of the model.
With Team Topologies, there isn’t a static structure with fixed hierarchies as in the Spotify Model. Instead, you combine the team types and interactions to suit your business goals in order to create value. Another difference is that the Spotify Model values team autonomy so much that it doesn’t directly address cross-team collaboration. Team Topologies advocates autonomy as well, but includes collaboration building blocks. However, it advises to keep interactions between teams tightly constrained.
What’s particularly interesting about Team Topologies in our opinion, is that it defines a set of composable patterns. As an implementer of Team Topologies, you are expected to combine these patterns in ways that optimise for flow in your context. Not only that, but you should expect this structure to evolve. When you sense that your context has evolved, you are expected reshape your organization’s team structure by reapplying these patterns. If this sounds like the design pattern catalogs that we have in software design, it’s because it is very much inspired by that idea.
Image: Henny Portman
Like software design patterns, Team Topologies even comes with some anti-patterns, such as team structures and communication styles to avoid. Unlike pattern catalogs, which typically are open-ended (or at least admit that they are incomplete), Team Topologies doesn’t encourage us to extend the model with new team types and interaction modes.
In our context of evaluating organizational models, using patterns and a catalog of patterns that can be assembled to structure an organization is a powerful idea worth exploring. Perhaps the other models would be better served with such a pattern language. Future models could take this approach even further and create pattern languages for other kinds of organizations that aren’t doing software engineering.
Team Topologies is rooted in the DevOps philosophy. Although it seems to claim to be suitable for all software organizations, it only addresses product-centric organizations. We don’t feel it is suitable for systems like life-critical medical devices or railway infrastructure, because it doesn’t have building blocks that address reliability or safety. It is biased towards this premise: create value and enable flow in product-oriented software organizations. It proposes a small, specific set of building blocks as the means to achieve that. If we wanted to, say, organize for human safety over fast flow, we’d need to come up with a different set of patterns for doing so. If we organized for software quality, correctness, worker wellbeing, learning, or domain understanding, our choices will be different again. Optimizing for flow may ignore other important aspects of systems that you may need to consider.
Heuristic: All models are created within a specific context. Does the original context match yours? Can it be adapted?
You could fit a need for considering human safety into Team Topologies, by moving the responsibility for overall safety into a stream-aligned team, having an enabling team to support them, and a platform team to build the necessary infrastructure. But that is not the same as the model being explicitly designed to enable safety. None of Team Topologies’ defined building blocks suggest that a specific team type or interaction mode serves to enable safety. Likewise, nothing in Team Topologies explicitly addresses learning: you can fit learning in there (the book suggests that some enabling teams are really like chapters and guilds from the Spotify Model), but nothing in the model explicitly supports learning. Team Topologies doesn’t have a rich model and language for how teams learn, as exists in the Agile Fluency model.
These omissions are not a problem with Team Topologies per se, but rather a consequence of its focus. Team Topologies has value creation and flow as its scope. By presenting only building blocks for managing flow, it frames flow as the central concern worth organizzing for. Subsequently, it does not include explicit building blocks for learning or safety or programmer happiness. These omissions should be no surprise.
A problem arises, however, if we adopt the Team Topologies model, with its building blocks and patterns and language—thinking it will solve our organizational problem—without first asking whether improving flow is indeed our most pressing need, and whether we have other needs. If we ask such questions, we can avoid the Golden Hammer Syndrome. Furthermore, by comparing models we can figure out what we need from a model and what its benefits and limitations are.
Explicit Building Blocks
Building blocks are explicit, named concepts in a model. They’re not the focus point of the model (like “control” in the hierarchy), but they’re the elements the model uses to create that focus (like the “levels” in the hierarchy).
A building block is a lightning rod. It predictably focuses attention and energy. The things you find most important should be first class building blocks in the model you’re introducing.
Heuristic: If you care about something, have a building block for it.
When you have a building block for something important, it draws people’s attention. Simply by existing, that building block tells people that this is something that requires their consideration. If, on the other hand, there is no building block for it, people might still give it attention, but there is no home for it. Any attention given could be spotty, brittle, temporary or even deemed unwelcome.
Most of the time, when you introduce a new model, you’ve spent the time understand it and buy into all it has to offer. But others will not understand it as deeply as you do, nor will they apply it as rigourously as you do. You may understand where the model can be extended or adapted, but others will apply the model as is. So even if you can make the model work for your context by combining its building blocks with your adaptations, it doesn’t mean others will. They are more likely to do it by the book (or by following your instructions). Adapting the models and their building blocks to your needs, gives you influence over people’s attention, and therefore over the outcomes they produce. If you don’t adapt these models, bringing focus to what you value will be much harder.
If, say, you’re reorganizzing and you want to improve how institutional learning happens in your organization, you can either pick a model that explicitly addresses it with its building blocks (for example, Chapters and Guilds in the Spotify Model), or you can adapt another model to suit your needs. It’s not that this learning won’t happen in, say, a hierarchy or Team Topologies. But without the attention-grabbing power of a good, explicit, and well-explained set of building blocks, there’s not much that tells others what to do and how to do it.
In all knowledge organizations, individual and institutional learning is important. If you’re looking to apply an organizational model, you can ask how they deal with learning, and whether that’s important to you. If learning is happening smoothly, you don’t need to touch it. But if you feel it requires change, you need to have learning explicit in the model.
If instead of improving learning or flow, you want to improve the quality of delivery, or the safety of life-critical systems, your organizational building blocks would need to express that, whether they’re blocks for types of teams, interactions, activities, meetings, or reports.
When the Team Gets Too Big: a Case Study
Models guide our choices for appropriate actions to take. As an example, let’s ask what we should do when a team gets too big. We’ll compare what different models suggest we do, to see what they teach us about their approach, and their underlying value systems.
Heuristic: When you’re evaluating models, find questions that are relevant in your context, and use them to compare the models.
In a traditional hierarchy, when a branch has too many people, you add more subbranches and a corresponding layer of management to manage the people who report to them. At some point, when a hierarchy gets too deep, you make it wider: you split it apart by adding departments or business units with their own management, and allow these to each have their own hierarchies. The value system behind hierarchy is control. It assumes that to scale control, you need to delegate control and control the controllers; so the idea of growing deeper or wider is embedded in the model.
From Jason Yip, a Senior Agile Coach at Spotify, we get some clues on how Spotify deals with splitting squads or tribes:4
“You wait for seams to appear: clunkiness in communication flow and interaction patterns. (…) Then you nudge things apart. Maybe a subgroup of people has a slightly different rhythm and events. Nothing formal, just nudging things apart. Over time you notice, hey, I’m not interacting with the other people in the team anyway. So you just formalize that [split]. If done correctly, this is mostly an acknowledgment and a non-event.”
Jason calls this “organic organizational design.” In that model, structure should always follow strategy.
In Team Topologies, when the flow of a Stream-Aligned Team slows, or the team is experiencing increased cognitive load, then it’s grown too big. To fix this you would look for multiple value streams hiding in there, and split the team according to those lines. You could also look for people working on a highly specialized aspect of the value stream, and move them to a separate Complicated Subsystem Team.
When a Platform Team gets too big, Team Topologies suggests that the topologies are fractal. The Platform Team acquires internal topologies, consisting of internal Stream-Aligned Teams, Complicated Subsystem Teams, and Enabling Teams.
The Spotify Model makes no mention of Platform teams, but how would it solve the problem of a Platform Team getting too big? We think that you would create a Chapter that oversees and supports multiple Platform Teams, helping them with standardization and tooling, but you wouldn’t have them impose too much. This solution is different from a Team Topologies’ Enabling Team, because the Chapter is composed of people from across the different Platform Teams.
Contrasting Spotify and Team Topologies
Interestingly, Team Topologies has no building blocks for any kind of group that spans people from different teams. Of course, a Team Topologies organization could have such groups, but as a model it doesn’t tell us to have them, doesn’t tell us how to organize them, or even to recognize that such groups exist.
In the same sense, Tribes are another key difference between the model at Spotify and Team Topologies. Tribes are larger structures than squads. A Tribe manages strategy and goal-setting at a higher level. According to Jason Yip, Squads are less important to preserve than Tribes. Tribes can evolve fast and react organically to change. Team Topologies puts teams first, suggesting to make them long-living and seeing teams as the central unit of organization. Teams are the only tool for looking at organizations, so there’s no natural place for organizzing in terms of business strategy. You organize for flow.
Another fundamental difference between these two models is that the Spotify Model doesn’t mention cognitive load. Instead, it talks about clunkiness, friction, long meetings, … The term “cognitive load” doesn’t show up in its vocabulary. By naming cognitive load, Team Topologies calls attention to it, telling us explicitly to look for and reduce it.
Team Topologies and the Spotify Model also perceive interaction between teams differently. The Spotify Model assumes interactions happen organically inside Squads, between Squads in Tribes, and cross-cutting though Chapters and Guilds.
Team Topologies sees too much interaction as an inhibitor to fast flow. Consequently, interactions need to be designed explicitly. Teams are supposed to be limited to as a few as possible interactions and interaction styles. Ideally, these interactions are expressed in a “team API”: a set of well-defined places where communication happens, such as pull requests, issue trackers, or chat channels.
Finally, when an interaction doesn’t work well in the Spotify Model, an Agile Coach will intervene. That Agile Coach is an explicit building block in the Spotify Model. In Team Topologies, there’s no building blocks for coaches, managers, or other roles that can intervene. (Possibly, Team Topologies wants teams to self-organize, but there’s also no mention of that option.) Again, lack of an explicit building block doesn’t preclude you from having manager roles in Team Topologies, but you won’t get any support for them from the model. Team Topologies doesn’t tell you to focus on such a role or the value it might bring to the organization.
Comparing Models Yourself
We encourage you to compare models to evaluate them. As we have shown, one way to compare models is to ask questions about how those different models might handle a particular problem you are interested in solving. To get started, pick situations (real or imagined) relevant to your environment and test them against the models. Then, try to imagine how each model would handle them.
This is speculative: during this evaluation you’re taking a theoretical view of how the models work and how they stack up against each other. That will never beat the insights you can get from practical experience applying the models. But it’s much cheaper to test-drive a model through realistic scenarios, than it is to put it into practice only to uncover big surprises a year down the road. Besides the measurable costs, reorganizations lead to “reorg fatigue.” Likewise, models that address aspects other than organizations might have varying costs. Running thought experiments with scenarios this way teaches you something about the value system of each model early on. It also exposes whether the model cares about the same things you find important.
As you run various scenarios, you’ll notice that:
some models may give very detailed advice (eg “split a team when it exceeds 10 members”);
others offer only vague guidance (eg. “teams shouldn’t be too big”, without quantifying what is too big or what should be done);
some models have nothing to say about that specific situation.
None of these are bad things. All models include and omit different things, and give more practical or more broad advice. But when applying model you need to understand who you’re getting married to.
Now that you have some ideas on how to compare models, we’d like to give you some more clues. First, we’ll discuss how some models (or their authors) are authoritative. They provide one way, and aren’t open to being extended. Then, we’ll discuss how our own cognitive biases can influence our perception of models. Finally, we’ll look into how models are intended to be used: as observational tools, or as prescriptions of how to act, or even as aspirational guidelines for a future state.
You might be familiar with SOLID. It’s a set of five object-oriented software design principles. The first letter of each principle forms the word SOLID. For example the Single Responsibility Principle is the “S” in SOLID. This model gives us a framing right off the bat:
There are only five design principles worth caring about.
It’s these five that matter, and not any others.
They apply to all software design, independent of context.
You’re not just supposed to apply these principles in an ad hoc fashion, but you’re supposed to apply them at all times.
The name implies that by applying them, your design will in fact be solid, which is a good thing.
As before, we want to break out of the framing of this model, so that we can understand it better. Here are some questions we ask: Why five? Why these five and not others? What if a sixth principle is discovered? In what context were these principles formulated? Do they apply to the design of all kinds of software in all contexts? Are they as true now as they were 25 years ago? Has software design changed since then?
As evidenced by their framing, it’s clear that the authors intended for this model to be authoritative. As an adopter of this model, you’re not supposed to extend it, or challenge it. (And many of its adopters seem to buy into this framing.)
In reality, there are dozens (if not hundreds) of design principles that have been discovered, long before and long after SOLID. And yet, the SOLID model didn’t adapt. Even if you wanted to evolve it, SOLID doesn’t offer any hooks for extensibility.
The Hierarchical organization model is another example of a definitive model. The building blocks are the hierarchy, where decisions made at the correct level are pushed down, and the reporting goes up the hierarchy chain. There’s no room in the scope of the model to extend it. Extend the hierarchy differently, altering decision-making, or changing the reporting structure, doesn’t fit in the belief system that underlies this model.
Some models are authoritative and definitive, while others are open to being extended. But models rarely make this distinction explicit. They don’t tell you whether to extend them and in what ways. It’s hidden in the language they use, the names and the structures they propose, and in the communications of the authors about the model.
Often, all you need is a straightforward model to tell you how to do something. In this case, the model’s simple building blocks are an enabling constraint: they help you to see something clearly, and make progress faster. In other cases, an authoritative model that seems suitable at first sight, can become a limiting constraint over time. In this case, on top of dealing with your original problem, you’re now also faced with problems caused by the limitations imposed by the model. You start force-fitting things into the building blocks, instead of using the building blocks to achieve your goals. Force-fitting causes you to lose essential components of your context because they don’t fit the model. Inversely, it also causes you to overemphasize unimportant components of your context because the model gives them focus. The metaphors that the model uses, can distort your perception of your context.
Finding alternative models, again, can help us out of this problem. In the case of the Hierarchical model, it’s the Social Network and the Value Creation Network models that break us out of constraints. They tell us to stop accepting the framing of the Hierarchy.
Heuristic: Are you fitting the model to your context, or are you force-fitting your context to the model?
Getting Too Cosy with the Model
When we are faced with something new, like a model, we are initially inclined to question it. But we want to believe. We like certainty. So as soon as we’re comfortable with the model, we stop questioning. Being in a state of cognitive ease makes us more accepting of new information or ideas.
Successful models tend to have some characteristics that contribute to our cognitive ease. They have a small number of building blocks clearly presented (perhaps even simplistically so), usually 7 or less. There’s symmetry, balance, and the structures are often repeating. Strangely, these characteristics make us believe the model is more sound, more rigorous, better thought out. It may well be, however, that the model’s structure has been simplified to be symmetrical, balanced, repetitive, and slim. We associate feeling at ease with goodness. Visualizations that fit our sense of “good” structure also help sell a model: quadrants, pyramids, funnels, radars, Venn diagrams with three neatly intersecting sets.5
Anecdotes are another tool for putting us at ease. If something is explained to us through stories, we’re more likely to accept it. We trust the story that a familiar face tells us more than when someone gives us critical facts. When a model comes to us with anecdotes of how it contributed to a successful outcome, we don’t question the story.
When we’re at cognitive ease, we trust our intuitions. We narrowly frame our problems to fit the models we’re comfortable with. We become overconfident.
The remedy to cognitive ease is a healthy dose of skepticism, aimed not just at new models, but at our existing models as well. Skeptics examine things, even if this makes them uncomfortable. They let go of certainty in order to see more of what the problem really is and what the model has to offer. We use “getting out of your comfort zone” for social situations or challenging tasks, but it applies to our inner world as well.
Heuristic: Be more skeptical of nice models, be more open minded to jarring models.
Be on the lookout for things that don’t fit, that are slightly awkward, that break the nice structure of the model you’re using. Allow yourself to entertain multiple competing models at the same time, and to context-switch between them liberally. This gives you a certain freedom: no longer boxed in by a single model, you can come to see the models for what they are: potentially useful worldviews, not “truths.”
Heuristic: Models are potentially useful worldviews, not truths.
Prescriptive, Descriptive, Aspirational
In the 16th century, the word network referred to wires or threads arranged in a fishing net like construct. In the 18th century, the term network was adopted to first represent railroads and canals, by the 1940’s it meant radio broadcasting, then was linked computers in the 1970’s, and finally, in the 1980’s, it acquired the meaning of building human relationships. Our modern use of the word is a metaphor, with extra steps.
The model that introduced the social network inside organizations is descriptive. Someone observed their environment, and noticed a pattern of informal communications. They then mapped this pattern to the metaphor of the network. Using the network metaphor, they found a shortcut to describe these real-world patterns. Perhaps you have seen these informal communications as well, but you had never truly noticed them. But then, when someone points out to you that it is a social network, a lightbulb goes on. You now have a language for it.
A descriptive model doesn’t have an intention, it only means to clarify, and help you see something in a new light. This change of perspective does have consequences: you might change your behavior, for example, by creating opportunities for people in your organization to mingle socially.
A prescriptive model, on the other hand, deliberately aims to change your behavior. The Spotify Model clearly has this intention. It’s telling us that our situation will improve if we restructure our organization according to the prescribed building blocks. The Spotify Model originated as a descriptive model of how people organized, before it was turned into a prescriptive model for organizations by others later.6
Heuristic: Does this model help me see something clearly, or does it intend to create a change in behavior?
The distinction between descriptive and prescriptive is not as a clear cut as we might like. Only in science do we find purely descriptive models. While the social network model intends to be descriptive, by introducing the term into your organization, you make a value choice. You’re saying that organizations should care about the social network. Usually this is followed by a prescription of how to do that. You may want to enable this network, leave it to evolve organically, or prohibit it. When you adopt a model that you believe to be descriptive, but that actually comes with an underlying value system that you unconsciously adopt, then the model controls you. You are inadvertently allowing it to affect your behavior. By critically evaluating whether a model is mostly descriptive or prescriptive, you can choose to adopt its values and use its prescriptions more intentionally.
Heuristic: Find the prescriptions hidden in innocent looking models, and use them to your advantage.
Some models go even further: they are aspirational. They describe an ideal situation that has not been observed before. The author has created a model of how they believe the world could work, but has not actually applied it, or only partially. Aspirational models serve as a call to action. They speculate that, if we were to adopt all its prescriptions, we could get close to that ideal state, but perhaps never reach it. If we understand that a model is aspirational, then our goal is not to apply it precisely to the letter, but to get as close to it as is useful and practical. Interestingly, some models start as aspirational, but as people start getting better at explaining and implementing them, they observe that the model does indeed give them its stated benefits. If used wisely, aspirational models can drive positive change by getting people to see a new possibility.
Heuristic: Consider that some models may not actually have been tried by the authors.
Putting the Model to Work
With many of the models we acquire throughout our lives, we only engage superficially, and that’s fine. We absorb and assimilate a lot of things without ever deeply engaging with them. This happens when, for example, you’re being onboarded at a new employer, and you simply learn their operating models for doing things in certain ways. The same goes when you’re given some design specifications and implement them as stated. And as a student, you’re usually expected to learn and adopt the models you are taught.
But when we’re trying to achieve something important, we need to take a more critical view. After learning of a new model, we need to decide whether we want to put it to work to help solve our problem. We need to engage more deeply with that model before we decide to commit to it. What if we could make this engagement an intentional process?
What we’ve been trying here, is to identify competing models and to compare them to understand their benefits and limitations, their framing, their values, and the contexts in which they apply. This comparison breaks us out of thinking there is only one option to solve our problem.
We’re making an upfront analysis of a model, before we invest in the model too heavily. This comes with risks: Because we haven’t lived with this model yet, we might be overanalyzing things. We may discount some of benefits because they only surface after spending some effort implementing the model. Still, if we do this analysis upfront, it saves us effort that we would otherwise waste on implementing an unsuitable model. As we’re not invested deeply yet, it’s much cheaper to identify faults in a model and how address them. To avoid the risk of “analysis paralysis,” we remain conscious that no model is perfect, but may still be useful.
Heuristic: There’s always another possible model.
Then, when we find a new model that has a reasonable representation of the problem we’re trying to solve, we can adopt it. We apply its building blocks, and execute the activities it prescribes. As we do, we need to be observant: there is a thin line between learning to adopt a model, and overfitting a model to our situation. This friction is valuable data. Maybe the model has problems, lacks detail, has too much detail, focuses on the wrong aspects, has unintended consequences, or doesn’t directly tackle the problem we’re actually trying to solve. Maybe the model doesn’t fit that well. Or perhaps we don’t want to use the model in the way it’s prescribed. Perhaps we have different values than those supported by the model. If we collect this feedback early, we can use it to reexamine the model.
Heuristic: Following a model as prescribed is not the same as achieving success.
In other words, no matter how beautiful your diagram of hierarchies, Squads and Tribes, or Stream-Aligned and Enabling Teams is, the proof is in the pudding.
As we gain more experience with a model, and learn from feedback, we’ll feel more comfortable reshaping the model, bringing in elements from other models we previously dismissed, adjusting it to our context, and making it truly our own.
In summary, these are the five steps for engaging with a model:
Intentionally study models
Analyse and compare them critically
Adopt a model in your context
Gather feedback about the impact
Reshape it to your needs
Did we just present to you a 5-step model for evaluating models? Are there other models for evaluating models you can compare this one to? Should you adopt our model or create your own? Does our model put you in a certain framing that isn’t relevant to your context? We leave all these question as an exercise for the reader.
Our worldviews and behaviors are mostly the result of other people’s models that we’ve assimilated, along with their biases. Models give us a framing, they highlight some aspects of the problem and obscure others. This is useful, assuming they focus on aspects that are important to us.
When we introduce an unsuitable model into our environment, it distracts us. It makes us ask questions in the framing of the model, and we don’t see that we should be asking different questions. It does this through its building blocks. These building blocks act like lightning rods. They attract the attention and energy of the people who are engaging with them.
Understanding a model’s framing is hard, precisely because we’ve been put into that framing. In that sense, we’re like the fish who can’t understand the world outside of the fishbowl, because nothing in their environment offers any building blocks to explain and structure it.
Acknowledging that you have limited agency is liberating. Comparing a model to other models gives us an edge: we can see more clearly which aspects we could highlight or obscure. By comparing models, we can break out of their constraints, and engage critically with the models.
Here’s a quick recap of blog posts I wrote in 2021.
Agile Experience Reports
Juggling Multiple Scrum Teams I introduce Iuri Ilatanski’s experience report about life as a multi-tasking Scrum Master. Juggling involves meeting each team’s specific needs. I was Iuri’s “shepherd”—his sounding board and advocate—as he wrote this report presented at Agile 2021. Thank you, Iuri, for being so open to discussion, reflection, and the hard work of revising your writing.
Design and Reality We shouldn’t assume domain experts have all the language they need to describe their problem (and all that you need to do as a software designer is to “capture” that language and make those real-world concepts evident in your code).
Models and Metaphors Listening to the language people use in modeling discussions can lead to new insights. Sometimes we find metaphors, that when pushed on, lead to a clearer understanding of the problem and clarity in our design.
Noisy Decisions After reading Noise: A Flaw in Human Judgment by Daniel Kahneman, Olivier Sibony, and Cass Sunstein I wrote about noisy decisions in the context of software design and architecture. These authors define noise as undesirable variability in human judgment. Often, we want to reduce noise and there are ways we can do so, even in the context of software.
Is it Noise or Euphony? At other times, however, we desire variability in judgments. In these situations variability isn’t noise, but instead an opportunity for euphony. And if you leverage that variability, you just might turn up some unexpected, positive results.
Too Much Salt? We build a more powerful heuristic toolkit when we learn the reasons why (and when) particular heuristics work the way they do. I now think it is equally important to seek the why behind the what you are doing as you cultivate your personal heuristics.
When a complex technical domain isn’t easily captured in a model, look for metaphors that bring clarity.
One of us (Mathias) consulted for a client that acted as a broker for paying copyright holders for the use of their content. To do this, they figured out who the copyright holders of a work were. Then they tracked usage claims, calculated the amounts owed, collected the money, and made the payments. Understanding who owned what was one of the trickier parts of their business.
-“It’s just a technical problem.”
-“But nobody really understands how it works!”
-“Some of us understand most of it. It just happens to be a complicated problem.”
-“Let’s do a little bit of modeling anyway.”
Determining ownership was a complicated data matching process which pulled data from a number of data sources:
Research done by the company itself
Offshore data cleaning
Publicly available data from a wiki-style source
Publicly available, curated data
Private sources, for which the company paid a licence fee
Direct submissions from individuals
Agencies representing copyright holders
The company had a data quality problem. Because of the variety of data sources, there wasn’t a single source of truth for any claim. The data was often incomplete and inconsistent. On top of that, there was a possibility for fraud: bad actors claimed ownership of authors’ work. Most people acted in good faith. Even then, the data was always going to be messy, and it took considerable effort to sort things out. The data was in constant flux: even though the ownership of a work rarely changes, the data did.
The engineers were always improving the “data matching”. That’s what they called the process of reconciling the inconsistencies, and providing a clear view on who owned what and who had to pay whom. They used EventSourcing, and they could easily replay new matching algorithms on historic data. The data matching algorithms matched similar claims on the same works in the different data sources. When multiple data sources concurred, the match succeeded.
Initially, when most sources concurred on a claim, the algorithm ignored a lone exception. When there was more contention about a claim, it was less obvious what to do. The code reflected this lack of clarity. Later the team realised that a conflicting claim could tell them more: It was an indicator of the messiness of the data. If they used their records of noise in the data, they could learn about how often different data sources, parties, and individuals agreed on successful claims, and improve their algorithm.
For example, say a match was poor: 50% of sources point to one owner and 50% point to another owner. Based on that information alone, it’s impossible to decide who the owner is. But by using historical data, the algorithm could figure out which sources had been part of successful matches more often. They could give more weight to these sources, and tip the scales in one direction or the other. This way, even if 50% of sources claim A as the owner and 50% claim B, an answer can be found.
The code mixed responsibilities: pulling data, filtering, reformatting, interpreting, and applying matching rules. All the cases and rules made the data matching very complicated. Only a few engineers knew how it worked. Mathias noticed that the engineers couldn’t explain how it worked very well. And the business people he talked to were unable to explain anything at all about how the system worked. They simply referred to it as the “data matching.” The team wasn’t concerned about this. In their eyes, the complexity was just something they had to deal with.
Mathias proposed a whiteboard modeling session. Initially, the engineers resisted. After all, they didn’t feel this was a business domain, just a purely technical problem. However, Mathias argued, the quality of the results determined who got paid what, and mistakes meant customers would eventually move to a competitor. So even if the data matching was technical, it performed an essential function in the Core Domain. The knowledge about it was sketchy, engineering couldn’t explain it, business didn’t understand it. Because of that, they rarely discussed it, and when they did, it was in purely technical terms. If communication is hard, if conversations are cumbersome, you lack a good shared model.
Through modeling, the matching process became less opaque to the engineers. We made clearer distinctions between different steps to pull data, process it, identifying a match, and coming to a decision. The model included sources, claims, reconciliations, exceptions. We drew the matching rules on the whiteboard as well, making those rules explicit first class concepts in the model. As the matching process became clearer, the underlying ideas that led to the system design started surfacing. From the “what,” we moved to the “why.” This put us in a good position to start discovering abstractions.
Gradually, the assumptions that they built the algorithm on, surfaced in the conversations. We stated those assumptions, wrote them on stickies and put them on the whiteboard. One accepted assumption was that when a data source is frequently in agreement with other sources, it is less likely to be wrong in the future. If a source is more reliable, it should be trusted more, therefore claims from that source pulled more weight in the decision of who has a claim to what. When doing domain discovery and modeling, it’s good to be observant, and listen to subtleties in the language. Words like “reliable,” “trust,” “pull more weight,” and “decision” were being used informally in these conversations. What works in these situations, is to have a healthy obsession with language. Add this language to the whiteboard. Ask questions: what does this word mean, in what context do you use it?
Through these discussions, the concept of “trust” grew in importance. It became explicit in the whiteboard models. It was tangible: you could see it, point to it, move it around. You could start telling stories about trust. Why would one source be more trusted? What would damage that trust? What edge cases could we find that would affect trust in different ways?
Trust as an Object
During the next modelling session, we talked about trust a lot. From a random word that people threw into the conversation, it had morphed into a meaningful term. Mathias suggested a little thought experiment: What if _Trust_ was an actual object in the code? What would that look like? Quickly, a simple model of Trust emerged. Trust is a Value Object, and its value represents the “amount” of trust we have in a data source, or the trust we have in a claim on a work or usage, or the trust we have in the person making the claim. Trust is measured on a scale of -5 to 5. That number determines whether a claim is granted or not, whether it needs additional sources to confirm it, or whether the company needs to do further research.
It was a major mindshift.
The old code dynamically computed similar values to determine “matches.” These computations were spread and duplicated across the code, hiding in many branches. The team didn’t see that all these values and computations were really aspects of the same underlying concept. They didn’t see that the computations could be shared, whether you’re matching sources, people, or claims. There was no shared abstraction.
But now, in the new code, those values are encapsulated in a first class concept called Trust objects. This is where the magic happens: we move from a whiteboard concept, to making Trust an essential element in the design. The team cleaned up the ad hoc logic spread across the data matching code and replaced it with a single Trust concept.
Trust entered the Ubiquitous Language. The idea that degrees of Trust are ranked on a scale from -5 to 5, also became part of the language. And it gave us a new way to think about our Core Domain: We pay owners based on who earns our Trust.
Trust as a Process
The team was designing an EventSourced system, so naturally, the conversation moved to what events could affect Trust. How does Trust evolve over time? What used to be matching claims in the old model, now became events that positively or negatively affected our Trust in a claim. Earning Trust (or losing it) was now thought of as a process. A new claim was an event in that process. Trust was now seen as a snapshot of the Trust earning process. If a claim was denied, but new evidence emerged, Trust increased and the claim was granted. Certain sources, like the private databases that the company bought a license for, were highly trusted and stable. For others, like the wiki-style sources where people could submit claims, Trust was more volatile.
During the discussions about the new Trust and Trust-building concepts, the team went back to the business regularly to make sure the concepts worked. They asked for their insights into how they should assign Trust, and what criteria they should use. We saw an interesting effect: people in the business became invested in these conversations and joined in modeling sessions. Data matching faded from the conversations, and Trust took over. There was a general excitement about being able to assign and evolve Trust. The engineers’ new model became a shared model within the business.
Trust as an Arithmetic
The copyright brokerage domain experts started throwing scenarios at the team: What if a Source A with a Trust of 0 made a claim that was corroborated by a Source B with a Trust of 5? The claim itself was now highly trusted, but what was the impact on Source A? One swallow doesn’t make Spring, so surely Source A shouldn’t be granted the same level of Trust as Source B. A repeated pattern of corroborated Trust on the other hand, should reflect in higher Trust for Source A.
During these continued explorations, people from the business and engineering listed the rules for how different events impacted Trust, and coding them. By seeing the rules in code, a new idea emerged. Trust could have its own arithmetic: a set of rules that defined how Trust was accumulated. For example, a claim with a Trust of 3, that was corroborated by a claim with a Trust of 5, would now be assigned a new Trust of 4. The larger set of arithmetics addressed various permutations of claims corroborating claims, sources corroborating sources, and patterns of corroboration over time. The Trust object encapsulated this arithmetic, and managed the properties and behaviors for it.
From an anemic Trust object, we had now arrived at a richer model of Trust that was responsible for all these operations. The team came up with polymorphic Strategy objects. These allowed them to swap out different mechanisms for assigning and evolving Trust. The old data matching code had mixed fetching and storing information with the sprawling logic. Now, the team found it easy to separate it into a layer that dealt with the plumbing separate from the clean Trust model.
The Evolution of the Model
In summary, this was the evolution:
Ad hoc code that computes values for matches.
Using Trust in conversations that explained how the current system worked.
Trust as a Value Object in the code.
Evolving Trust as a process, with events (such as finding a matching claim) that assigned new values of Trust.
Trust as a shared term between business and engineering, that replaced the old language of technical data matching.
Exploring how to assign Trust using more real-world scenarios.
Building an arithmetic that controls the computation of Trust.
Polymorphic Strategies for assigning Trust.
When you find a better, more meaningful abstraction, it becomes a catalyst: it enables other modeling constructs, allowing other ideas to form around that concept. It takes exploration, coding, conversations, trying scenarios, … There’s no golden recipe for making this happen. You need to be open to possibility, and take the time for it.
The engineers originally introduced the concept of “matching,” but that was an anemic description of the algorithm itself, not the purpose. “If this value equals that value, do this.” Data matching was devoid of meaning. That’s what Trust introduces: conceptual scaffolding for the meaning of the system. Trust is a magnet, an attractor for a way of thinking about and organizing the design.
Initially, the technical details of the problem were so complicated, and provided such interesting challenges to the engineers, that that was all they talked about with the business stakeholders. Those details got in the way of designing a useful Ubiquitous Language. The engineers had assumed that their code looked the way it needed to look. In their eyes, the code was complex because the problem of matching was complex. The code simply manifested that complexity. They didn’t see the complexity of that code as a problem in its own right. The belief that there wasn’t a better model to be found, obscured the Core Domain for both business and engineering.
The domain experts were indeed experts in the copyrights domain, and had crisp concepts for ownership, claims, intellectual property, the laws, and the industry practices. But that was not their Core Domain. The real Core was the efficient, automated business they’re trying to build out of it. That was their new domain. That explains why knowledge of copyright concepts alone wasn’t sufficient to make a great model.
Before they developed an understanding of Trust, business stakeholders could tell you detailed stories about how the system should behave in specific situations. But they had lacked the language to talk about these stories in terms of the bigger idea that governs them. They were missing crisp concepts for them.
We moved from raw code, to a model based on the new concept of Trust. But what kind of thing is this Trust concept? Trust is a metaphor.1 Actual trust is a human emotion, and partly irrational. You trust someone instinctively, and for entirely subjective reasons that might change. Machines don’t have these emotions. We have an artificial metric in our system, with algorithms to manipulate it, and we named it Trust. It’s a proxy term.
This metaphor enables a more compact conversation, as evidenced by the fact that engineers and domain experts alike can discuss Trust without losing each other in technical details. A sentence like “The claims from this source were repeatedly confirmed by other sources,” was replaced by “This source has built up trust,” and all knew what that entailed.
The metaphor allows us to handle the same degree of complexity, but we can reason about determining Trust without having to understand every detail at the point where it’s used. For those of us without Einstein brains, it’s now a lot easier to work on the code, it lowers the cognitive load.
A good metaphor in the right context, such as Trust, enables us to achieve things we couldn’t easily do before. The team reconsidered a feature that would allow them to swap out different strategies for matching claims. Originally they had dismissed the idea, because, in the old code, it would have been prohibitively expensive to build. It would have resulted in huge condition trees and sprawling dependencies on shared state. They’d have to be very careful, and it would be difficult to test that logic. With the new model, swapping out polymorphic Strategy objects is trivial. The new model allows testing low level units like the Trust object, higher level logic like the Trust-building process, and individual Claims Strategies, with each test remaining at a single level of abstraction.
Our Trust model not only organizes the details better, but it is also concise. We can go to a single point in the code and know how something is determined. A Trust object computes its own value, in a single place in the code. We don’t have to look at twenty different conditionals across the code to understand the behavior; instead we can look at a single strategy. It’s much easier to spot bugs, which in turn helps us make the code more correct.
A good model helps you reason about the behavior of a system. A good metaphor helps you reason about the desired behavior of a system.
The Trust metaphor unlocked a path to tackle complexity. We discovered it by listening closely to the language used to describe the solution, using that language in examples, and trying thought experiments. We’re not matching data anymore, we’re determining Trust and using it to resolve claims. Instead of coding the rules, we’re now encoding them. We’re better copyright brokers because of this.
Be wary of bad, ill-fitting metaphors. Imagine the team had come with Star Ratings as the metaphor. Sure, it also works as a quantification, but it’s based on popularity, and calculates the average. We could still have built all the same behavior of the Trust model, but with a lot of bizarre rules, like “Our own sources get 20 five-star ratings.” When you notice that you have to force-fit elements of your problem space into a metaphor, and there’s friction between what you want to say and what that metaphor allows you to say, you need to get rid of it. No metaphor will make a perfect fit, but a bad metaphor leads you into awkward conversations without buying you clarity.
To make things trickier, whenever you introduce a new metaphor, it can be awkward at first. In our case study, Trust didn’t instantly become a fully explored and accepted metaphor. There’s a delicate line between the early struggles of adopting a new good metaphor, and one that is simply bad. Keep trying, work on using your new metaphor, see if it buys you explanatory power, and don’t be afraid to drop it if it does not.
And sometimes, there simply isn’t any good metaphor, or even a simpler model to be found. In those cases, you just have to crunch it. There’s no simplification to be found. You just have to work out all the rules, list all the cases, and deal with the complexity as is.
To find good metaphors, put yourself in a position where you’ll notice them in conversation. Invite diverse roles into your design discussions. Have a healthy obsession with language: What does this mean? Is this the best way to say it? Be observant about this language, listen for terms that people say off the cuff. Capture any metaphors that people use. Reinforce them in conversations, but be ready to drop them if you feel you have to force-fit them. Is a metaphor bringing clarity? Does it help you express the problem better? Try scenarios and edge cases, even if they’re highly unlikely. They’ll teach you about the limits of your metaphor. Then distill the metaphor, agree on a precise meaning. Use it in your model, and then translate it to your code and tests. Metaphors are how language works, how our brains attach meaning, and we’re using that to our advantage.
Practiced speakers and writers know that good examples rarely tell the whole story. Instead they shape their narratives to make the big ideas stand out. Stories are bent ever so slightly, plot details are pared down, leaving space for emphasis and audience impact.
I wouldn’t go so far as to say we invent fiction, but rather that we simplify our stories to make them compelling. Too many details and our audience would tune us out. And when we repeatedly tell these stories, we come to believe we’ve pared down the narrative to its essence. We’ve nailed it!
But what happens when you encounter information that sheds new light on such a story? What if the story you’ve told no longer rings quite true?
The past few years I’ve explored Billy Vaughn Koen’s definition of heuristics as they relate to software design and architecture. I’ve written blog posts and essays, presented talks, keynotes, and workshops about heuristics (for a gentle introduction to different kinds of heuristics see Growing Your Personal Design Heuristics Toolkit).
Along the way I’ve encouraged people to discover, distill, and own their personal heuristics. I advise them to not just take every bit of advice they find about software design as being authoritative. Instead, they should question the validity of that advice’s applicability to their specific context. They should also bring their own heuristics they’ve accrued through experience to bear on the problem at hand.
I start most heuristics presentations with a story about my experience cooking my very first Blue Apron recipe for Za’atar Roasted Broccoli Salad (for details see Nothing Ever Goes Exactly by the Book). I jokingly point out all the places that the recipe suggests adding salt. I then postulate that if I just blindly followed Blue Apron instructions without applying any judgment, the dish would be way too salty.
Instead of following the recipe, I told how I used my past experiences to “modify” the instructions to fit with my understanding of what makes for a tasty dish. In short, I ignored lots of places where the recipe suggested adding salt.
My heuristic for this situation was to ignore advice on where to add salt if it seems excessive and only add salt to taste at the end. Following that heuristic, I most likely made a much blander dish that, while it looked great, undoubtedly lacked flavor.
But… achieving a tasty dish wasn’t the point of my original story!
Instead, it was to encourage using personal judgment and heuristics based on past experiences. I wanted to emphasize that we each have experiences and insights that we can and should draw on in many situations. Simply trusting and blindly following “experts” or “recipes” because they are published or credentialed can lead us astray—or to cooking inedible dishes. We should value and treasure our experiences and draw upon the heuristics we’ve accrued through those experiences.
Ta-da! Point made! Perhaps…
A week ago as I was waiting for surgery to repair my broken nose (that’s another story for another time) I started reading How to Taste, by Becky Selengut. At the time I was detached, slightly impatient, and resigned to just being there in the moment. The doctor was late and I had time to kill.
The introductory first chapter starts: “Telling you to ‘season to taste’ does nothing to teach you how to taste—and that is precisely the lofty goal of this book. Once you know the most common culprits when your dish is out of whack, you’ll save tons of time spinning your wheels grabbing for random solutions. You’ll start thinking like a chef. Some people are born knowing how to do this—they are few and far between and most likely have more Michelin stars that you or I; the rest of us need to be taught. I’ve got your back.”
Now that grabbed my attention!
Unless I was superhuman (I’m not), I can’t rely on my instincts to become a better cook, knowing when and how much seasoning or salt to add.
My experiences cooking have certainly been ad hoc. And the heuristic I applied for salting that Blue Apron dish came from who knows where. I never learned why I was doing what I was doing when following a recipe or ignored some parts of it. Instead, I learned a few shortcuts and substitutions, largely through combing the internet. And while my technique may have improved over time, I haven’t developed the ability to craft a dish with nuanced flavors, let alone improvise one.
Becky suggests reading her book “…start[ing] at the beginning, as I intend to build upon the concepts one puzzle piece at a time.” Each chapter presents fundamental facts, reinforced by a recipe that highlights the important points of the chapter and then suggesting Experiment Time activities intended to develop a reader’s palate
A good way to learn how to exercise judgment is to perform structured experiments after you’ve learned a bit of theory and why things—in this case, the chemistry of cooking—work the way they do.
I quickly read through the chapter on Salt and learned: Salt is a flavorant—bringing out the flavor of other ingredients. Salting early and often can improve taste dramatically. For example, adding salt to onions as they sauté can speed up the cooking process and causes them to sweat out water. And when you only season a soup at the end, no matter how much salt you add, the flavors of unsalted ingredients (for example potatoes), fall flat. You end up over salting the soup stock and still having tasteless, bland potatoes. Salt needs to be added at the right time, often at several steps in the cooking process, to have the desired result. And to my surprise, different kinds of salt—iodized, kosher, flaky, fine-grained sea salt, each have their own flavoring properties and ratios in recipes.
This brought to mind a whole new way of thinking about my Blue Apron cooking experience. Blue Apron didn’t have bad recipes, but their recipes didn’t make me a better cook either. This is because most recipes focus on the how—not the why. Their pretty little pictures and step-by-step instructions did nothing to help me to achieve an understanding of how to achieve tasty dishes.
And that’s a problem if I want to get better at cooking tasty dishes and not simply at following recipes.
I’m afraid way too much information we absorb—whether it is about cooking or agile practices or software development—is presented as step-by-step lists of instructions, without any explanation of why it makes sense to do so or the consequences of not doing a particular step specifically as instructed.
Consequently, we learn a bunch of procedures, or simply cut and paste them. We follow instructions because somebody says this is what we should do. Over time we may build up a playbook of those procedures but our understanding of why these procedures work isn’t very deep or rich or adaptable.
If we want to truly gain proficiency in cooking (or software design or programming or running or gardening or basket weaving), instruction that emphasizes the why along with the how is what we need.
Teach me some facts that ground what I’m about to do in a bit of knowledge. Spark my curiosity. Inspire me. And then give me tasks that let me tinker and practice applying that knowledge. Only then will my actions become integrated with that knowledge, allowing me to build up adaptable heuristics that I can use in novel situations.
In hindsight, I now believe that the story I told about applying my personal heuristics and knowledge to a problem was OK. It’s reasonable to be a healthy skeptic when someone says, “Just do as I say. Trust me,” when attempting a new task. Distilling you own heuristics from previous experiences and applying them in familiar situations is also good. And writing them down helps to bring them to your awareness.
But in addition, I now think it is equally important to seek the why behind the what you are doing. And to loosen your grip on those simpler narratives you’ve held dear. They are not the whole story and they may be holding you back. Be open to new information that may reshape your stories and enhance your heuristic toolkit.
Perhaps one day, with enough knowledge and practice, I’ll be able to create a flavor profile for a dish instead of merely following the recipe.
There is a fallacy about how domain modelling works. The misconception is that we can design software by discovering all the relevant concepts in the domain, turn them into concepts in our design, add some behaviors, and voilà, we’ve solved our problem. It’s a simplistic perception of how design works: a linear path from A to B:
understand the problem,
end up with a solution.
That idea was so central to early Object-Oriented Design, that one of us (Rebecca) thought to refute it in her book:
“Early object design books including [my own] Designing Object-Oriented Software [Wirfs-Brock et al 1990], speak of finding objects by identifying things (noun phrases) written in a design specification. In hindsight, this approach seems naive. Today, we don’t advocate underlining nouns and simplistically modeling things in the real world. It’s much more complicated than that. Finding good objects means identifying abstractions that are part of your application’s domain and its execution machinery. Their correspondence to real-world things may be tenuous, at best. Even when modeling domain concepts, you need to look carefully at how those objects fit into your overall application design.”
The idea has persisted in many naive interpretations of Domain-Driven Design as well. Domain language and Ubiquitous Language are often conflated. They’re not the same.
Domain language is what is used by people working in the domain. It’s a natural language, and therefore messy. It’s organic: concepts are introduced out of necessity, without deliberation, without agreement, without precision. Terminology spreads across the organization or fades out. Meaning shifts. People adapt old terms into new meanings, or terms acquire multiple, ambiguous meanings. It exists because it works, at least well enough for human-to-human communication. A domain language (like all language) only works in the specific context it evolved in.
For us system designers, messy language is not good enough. We need precise language with well understood concepts, and explicit context. This is what a Ubiquitous Language is: a constructed, formalized language, agreed upon by stakeholders and designers, to serve the needs of our design. We need more control over this language than we have over the domain language. The Ubiquitous Language has to be deeply connected to the domain language, or there will be discord. The level of formality and precision in any Ubiquitous Language depends on its environment: a meme sharing app and an oil rig control system have different needs.
Talking of oil rigs:
Rebecca was invited to consult for a company that makes hardware and software for oil rigs. She was asked to help with object design and modelling, working on redesigning the control system that monitors and manages sensors and equipment on the oil rig. Drilling causes a lot of friction, and “drilling mud” (a proprietary chemical substance) is used as a lubricant. It’s also used as a carrier for the rocks and debris you get from drilling, lifting it all up and out of the hole. Equipment monitors the drilling mud pressure, and by changing the composition of the mud during drilling, you can control that pressure. Too much pressure is a really bad thing.
And then an oil rig in the gulf exploded.
As the news stories were coming out, the team found out that the rig was using a competitor’s equipment. Whew! The team started speculating about what could have happened, and were thinking about how something like that could happen with their own systems. Was it faulty equipment, sensors, the telemetry, communications between various components, the software?
When in doubt, look for examples. The team ran through scenarios. What happens when a catastrophic condition occurs? How do people react? When something fails, it’s a noisy environment for the oil rig engineers: sirens blaring, alarms going off, … We discovered that when a problem couldn’t be fixed immediately, the engineers, in order to concentrate, would turn off the alarms after a while. When a failure is easy to fix, the control system logs reflect that the alarm went on and was turned off a few minutes later.
But for more consequential failures, even though these problems take much longer to resolve, it still shows up on the logs as being resolved within minutes. Then, when people study the logs, it looks like the failure was resolved quickly. But that’s totally inaccurate. This may look like a software bug, but it’s really a flaw in the model. And we should use it as an opportunity to improve that model.
The initial modeling assumption is that alarms are directly connected to the emergency conditions in the world. However, the system’s perception of the world is distorted: when the engineers turn off the alarm, the system believes the emergency is over. But it’s not, turning an alarm off doesn’t change the emergency condition in the world. The alarms are only indirectly connected to the emergency. If it’s indirectly connected, there’s something else in between, that doesn’t exist in our model. The model is an incomplete representation of a fact of the world, and that could be catastrophic.
The team explored scenarios, specifically the weird ones, the awkward edge cases where nobody really knows how the system behaves, or even how it should behave. One such scenario is when two separate sensor measurements raise alarms at the same time. The alarm sounds, an engineer turns it off, but what happens to the second alarm? Should the alarm still sound or not? Should turning off one turn off the other? If it didn’t turn off, would the engineers think the off switch didn’t work and just push it again?
By working through these scenarios, the team figured out there was a distinction between the alarm sounding, and the state of alertness. Now, in this new model, when measurements from the sensors exceed certain thresholds or exhibit certain patterns, the system doesn’t sound the alarm directly anymore. Instead, it raises an alert condition, which is also logged. It’s this alert condition that is associated with the actual problem. The new alert concept is now responsible for sounding the alarm (or not). The alarm can still be turned off, but the alert condition remains. Two alert conditions with different causes can coexist without being confused by the single alarm. This model decouples the emergency from the sounding of the alarm.
The old model didn’t make that distinction, and therefore it couldn’t handle edge cases very well. When at last the team understood the need for separating alert conditions from the alarms, they couldn’t unsee it. It’s one of those aha-moments that seem obvious in retrospect. Such distinctions are not easily unearthed. It’s what Eric Evans calls a Breakthrough.
An Act of Creation
There was a missing concept, and at the first the team didn’t know something was missing. It wasn’t obvious at first, because there wasn’t a name for “alert condition” in the domain language. The oil rig engineers’ job isn’t designing software or creating a precise language, they just want to be able to respond to alarms and fix problems in peace. Alert conditions didn’t turn up in a specification document, or in any communication between the oil rig engineers. The concept was not used implicitly by the engineers or the software; no, the whole concept did not exist.
Then where did the concept come from?
People in the domain experienced the problem, but without explicit terminology, they couldn’t express the problem to the system designers. So it’s us, the designers, who created it. It’s an act of creative modeling. The concept is invented. In our oil rig monitoring domain, it was a novel way to perceive reality.
Of course, in English, alert and alarm exist. They are almost synonymous. But in our Ubiquitous Language, we agreed to make them distinct. We designed our Ubiquitous Language to fit our purpose, and it’s different from the domain language. After we introduced “alert conditions”, the oil rig engineers incorporated it in their language. This change in the domain is driven by the design. This is a break with the linear, unidirectional understanding of moving from problem to solution through design. Instead, through design, we reframed the problem. Is it a better model?
How do we know that this newly invented model is in fact better (specifically, more fit for purpose)? We find realistic scenarios and test them against the alert condition model, as well as other candidate models. In our case, with the new model, the logs will be more accurate, which was the original problem.
But in addition to helping with the original problem, a deeper model often opens new possibilities. This alert conditions model suggests several:
Different measurements can be associated with the same alert.
Alert conditions can be qualified.
We can define alarm behaviors for simultaneous alert conditions, for example by spacing the alarms, or picking different sound patterns.
Critical alerts could block less critical ones from hogging the alarm.
Alert conditions can be lowered as the situation improves, without resolving them.
These new options are relevant, and likely to bring value. Yet another sign we’d hit on a better model is that we had new conversations with the domain experts. A lot of failure scenarios became easier to detect and respond to. We started asking, what other alert conditions could exist? What risks aren’t we mitigating yet? How should we react?
Design Creates New Realities
In a world-centric view of design, only the sensors and the alarms existed in the real world, and the old software model reflected that accurately. Therefore it was an accurate model. The new model that includes alerts isn’t more “accurate” than the old one, it doesn’t come from the real world, it’s not more realistic, and it isn’t more “domain-ish”. But it is more useful. Sensors and alarms are objective, compared to alert conditions. Something is an alert condition because in this environment, we believe it should be an alert condition, and that’s subjective.
The model works for the domain and is connected to it, but it is not purely a model of the problem domain. It better addresses the problems in the contexts we envision. The solution clarified the problem. Having only a real world focus for modelling blinds us to better options and innovations.
These creative introductions of novel concepts into the model are rarely discussed in literature about modelling. Software design books talk about turning concepts into types and data structures, but what if the concept isn’t there yet? Forming distinctions, not just abstractions, however, can help clarify a model. These distinctions create opportunities.
The model must be radically about its utility in solving the problem.
“Our measure of success lies in how clearly we invent a software reality that satisfies our application’s requirements—and not in how closely it resembles the real world.”
The book Noise: A Flaw in Human Judgment by Daniel Kahneman, Olivier Sibony, and Cass Sunstein has me thinking deeply about noisy decisions. In this context, noise is defined as undesirable variability in judgments. They explain two different kinds of noise—level noise (variability in the average level of judgments by different people) and pattern noise. Pattern noise is further broken down into the unique noise individuals bring into any decision and occasion noise—noise caused by the particular context surrounding particular decisions. Occasion noise can be influenced by our mood, the interactions with people we’re deciding with, what we ate for dinner last night, or even the weather.
So when is noise worth reducing? And what can we do to reduce that noise? How do we know our efforts at noise reduction have the desired effect?
Are there situations where variability might be desirable? I haven’t found a name in the literature for such desirable variation. Perhaps euphony—a harmonious succession of words having a pleasing sound—is one possibility. In these situations we’re favoring finding some euphony over conforming to a noise-free rigid standard for our judgments.
I’ll use the review of conference submissions of papers, talks, and workshops as an exampleof where both noise and euphony play a part in our decision-making, as it is one I am quite familiar with.
One major source of variability is when new reviewers join a review committee. Newcomers often look at submissions differently than experienced reviewers. But not all variability is noise. If some variability is welcomed, expected, and encouraged, the review process greatly benefits from fresh perspectives. This kind of variability adds spice.
And yet, there may be standards (whether formally written down or more loosely held) we’d like uphold for what we consider a worthy submission. One way to reduce level noise in reviews is to ensure that reviewers understand these expectations. One way to convey this information is to hold a meeting where we discuss and present examples of submissions and exemplary reviews (reviews from prior years are a good source). Newcomers can learn what a reasonable proposal is and what is expected in a review. They also get to know their peers, ask questions and, in effect, “calibrate” their expectations for reviewing.
But this meeting is insufficient to remove another major source of noise—occasion noise caused by group interactions. Kahneman, Sibony, and Sunstein state: “Groups can go in all sorts of directions depending in part on factors that should be irrelevant. Who speaks first, who speaks last, who speaks with confidence, who is wearing black, who is seated next to whom, who smiles or frowns or gestures at the right moment—all these factors, and many more, affect outcomes.” Group dynamics introduce noise.
But there are several practical ways to further reduce the noise in group decisions. Oscar Nierstrasz wrote a set of patterns called Identify the Champion for reviewing academic papers. I encourage anyone running a conference to consider a review process along the lines of what Oscar introduces. I’ve adapted these patterns and process to non-academic conference reviewing with only a few minor tweaks.
The key ideas in these patterns are the roles of champion and detractor, and a structured process for discussing submissions. Champions are strong advocates for a submission who are prepared to discuss its merits; detractors disapprove of a submission and are prepared to discuss its weaknesses.
Submissions are discussed in groups according to their highest and lowest scores. Care is taken to identify proposals with both extreme high and low scores, and to not to rank submissions numerically. If a submission has no champion, it isn’t discussed. It is rejected. Ranking and then discussing submissions one-by-one in order would only add level noise (actually I find we get numbed by reviewing and tend to reject “lower ranked” submissions without enough consideration).
The review committee is asked to suspend final judgments until all championed submissions are presented. The champion is first invited introduce the submission and explain why it should be accepted. Then, detractors are invited to state their reasons. At the end of all presentations, discussion is opened for all and the committee tries to reach consensus.
In practice, following this discussion protocol, it is easy to accept outstanding submissions—they typically have plenty of champions. This leaves the bulk of our time to dig into the strengths and weaknesses those championed submission that have mixed reviews.
The Identify the Champion process forces me to hit “pause” on my judgments and to not jump to premature conclusions. And the first thing we hear about any submission are its positive aspects. When detractors speak, I get a richer understanding of that submission. Although I might have had some initial impressions, I find they can and do change.
Sometimes I warm up to a submission. At other times, detractors’ perspectives grab my attention and make me revisit whether the submission is as strong as I had initially thought.
The cumulative weight of all this discussion has an even more profound effect. I find I am much more accepting of the outcome: what will happen will happen. Yes, there is unpredictability in this decision-making process. But we’re all trying to make reasonable decisions as a group. I end up actively engaged in making the outcome the best it can be and supportive of our collective decisions.
Although the Identify the Champion review process still has noise (it is hard to eliminate noise caused by group dynamics entirely), I believe it to be less noisy than most other review processes I’ve participated in.
One downside, however, is that it can be exhausting. To avoid having some occasion noise creep back in, it’s good to ensure that reviewers get sufficient breaks to meet their personal needs, and not get too tired or cranky or hungry.
One place I’ve applied my adaptation of the Identify the Champion pattern is for Agile Alliance experience report submissions. Experience report submissions are “pitches” for written experience reports. Only after a submission has been accepted does the actual writing begin. So as reviewers, we’re not only judging the topic of the pitch but also whether the submitter will be able to write a compelling report. Champions of experience reports also commit to shepherding the writing of the reports. These shepherd-champions commit to reviewing and commenting on drafts of reports are as they are written over a period of several weeks. Now that’s real commitment! Frequently we have more championed submissions than room in the conference program. So our judgments come down to some difficult choices.
Before we hold our review meeting, we ask reviewers to give us two lists: submissions they’d like to shepherd and an optional list of submissions they’d like to see on the program (but do not want to shepherd). At our meeting, we then have a lively discussion where champions forcefully advocate for their proposals and gain others’ support. Once again, I find we spend most of our time discussing those submissions that have mixed reviews. But we also spend time a lot of time listening to champions and then as a group making tradeoffs between submissions (remember we have more good submissions than we have capacity to accept them). The message we convey to all reviewers is that that if you really want to shepherd a submission, we as a group will support your decision to be a shepherd-champion. But let’s discuss first.
We can’t guarantee the quality of any final report. We base our judgments on both what the submitters have written (in many cases, there has been a back and forth conversation between submitters and reviewers that we can all see that has led submitters to reshape and refine their proposals) as well as the convincing arguments of champions.
Judging conference submissions is subjective. Our process acknowledges that. We accept the risk of selecting a less-than-stellar report proposal over missing an opportunity for a novel or insightful report.
Is it our goal to eliminate noise in our decision-making? Where we can, yes. But, that isn’t our only goal. If we tried to eliminate it entirely we might end up establishing standards for experience report submissions that would inadvertently filter out newness or novelty. In our search for a bit of euphony we stretch out to accept a submission if there is a convincing champion. Consequently, we accept a little variability (and unpredictability) in our decision-making. However, at the end of our review process, reviewers are generally happy with the proposals we accept, happy with their shepherding assignments, and eager to begin working with their experience report authors. An important aspect of our process, which cannot be understated, is that we also work hard to make good matches between each champion-shepherd and prospective authors. Not only do reviewers buy into the review process, they also commit to being ongoing champions.
Noise reduction is important in many situations, especially group decisions. Paying careful attention to how the group is informed, discusses, and then decides can reduce noise. Paying attention to the voices of champions is one way to turn up euphony. By tuning your decision-making processes you can achieve these goals.
“The world is noisy and messy. You need to deal with the noise and uncertainty.”–Daphne Koller
I have tinnitus. When there isn’t much sound in my environment, for me it still isn’t quiet. I hear a constant background hum. It is hard to describe what this noise sounds like. I’ve lived with it for too long. Remembering back to when I first noticed it, I thought there was some nearby electrical device humming. Was it my phone plugged into the wall outlet? Or??? I remember getting up from bed to hunt for the source of that noise.
I can’t forget that noise or ignore it. It doesn’t go away. But it doesn’t dominate my headspace. I’ve learned to slip between that noise and my desire to sleep or to just be in a quiet place, and not let it distract me. I’ve learned to deal with tinnitus.
The entire book is about the “noise” in human judgments and what we can do to lessen its effects. So what exactly is this noise? A simple definition is “noise” is undesirable variability in judgments. Call this system noise if you will.
Both recurring and onetime decisions are influenced by noise. Depending on the time of day, how well I slept last night, what others say, and even how we as a group decide how to decide effects my judgment. This noise, in addition to any biases I have, affects all my judgments.
Kahneman, Sibony and Sunstein introduce two different types of system noise: level noise and pattern noise. Let’s consider each in turn.
Level noise is easiest to understand. It is the variability in the average level of judgments by different people. People judge on different scales. Consider rating a talk at a conference. Perhaps you never give a conference speaker the highest possible rating because you believe they could do better. Or, maybe if you are star struck, you always rank a presentation from a well-known speaker more highly. Personally, I know that I tend to not rate speakers either as very high or very low, because, well…I’m sort of middling with my ratings. On average, humans aren’t average in their judgments.
The other kind of noise, pattern noise, is often an even bigger factor in our judgments. It is comprised two parts: occasion noise and our own personal idiosyncratic tics. Occasion noise is the variability in the judgment at different points in time. Depending on my mood, how stressful the situation, how well I slept last night, or how the question is put to me, my judgment will vary. A simple example of occasion noise that software folks can relate to is estimating how long it will take to complete a task. My mind isn’t the same today as it was yesterday. Heck, from moment to moment, I might give a different answer simply because I am thinking about the task differently, or I that I am hungry (and hence tend to come to a snap judgment), or I’m grumpy, or I’m happy.
The second source of pattern noise is our personal attitudes toward the particular judgment context. Consider, for example, this kind of noise when reviewing conference proposals for papers or talks. Some reviewers are harsher in their personal rating for some proposals and more lenient in others. This variability reflects a complex pattern in the individual attitudes of reviewers toward particular proposals. For example, one person may be relatively generous in their review of proposals on a particular topic. Another may be particularly keen on proposals that seem to break new ground but be a harsher judge of proposals on topics that are perceived to cover familiar territory.
As Kahneman, Sibony, and Sunstein state: “Noise in individual judgment is bad enough. But group decision making adds another layer to the problem. Groups can go in all sorts of directions depending in part on factors that should be irrelevant. Who speaks first, who speaks last, who speaks with confidence, who is wearing black, who is seated next to whom, who smiles or frowns or gestures at the right moment—all these factors, and many more, affect outcomes.”
Having been part of many conference review teams as well as on the receiving end of their reviews over the years…I find that the dynamics of group decisions to be a particularly salient example of system noise.
The information about system noise in general and noise in group decision making can be rather depressing. If we humans are naturally wired to be imperfect and flawed in our judgments, how can we hope to make reasonable decisions? And, once we become aware of our judgment errors, and try to be better decision-makers, the actions required to lessen the effects of noise these authors suggest seem surprisingly difficult to carry out.
Awareness is a good first step. But it’s not enough. In contrast to tinnitus, of which I’m constantly aware, the noise in our judgments is at first difficult to perceive. But once you become aware of sources of system noise, you start noticing them everywhere. How and when (and even whether) it is appropriate or feasible to mitigate these sources of noise is a topic for another day.