My thoughts on how to build better, more robust, more maintainable, more humane, and greener software.
Reading time: 6 minutes
This post was inspired by Gerald Weinberg's book "The Secrets of Consulting" and his observation that "nothing new ever works."
We’ve all been there: the old system has too many shortcomings and is too costly to maintain so we will build a new, better replacement from the ground up.
Promises are made to justify the cost and investment: “this will solve X, Y, and Z,” “it will be more reliable,” and “we won’t spend as much time maintaining it.” And my favorite: “the last time didn’t go so great, but this time it will be different!”
Hope induces over-confidence and over-optimism. The result is to throw caution to the wind, making change even riskier than it already is.
Luckily, there are strategies to manage and reduce risk so that change becomes possible and less expensive.
There is wisdom in the saying “If it ain’t broke, don’t fix it.”
A rule of thumb to assess whether a particular change is necessary is simply asking “why?”
For change to be necessary, the reason must be about a high-impact issue (system’s instability/rigidity, chronic failure, high staff turnover, competitive threat) rather than about yielding a marginal improvement or about trends (“everyone else is doing it”).
The specifics depend on each organization, asking whether the change is necessary at all is the first step that is often overlooked.
Some changes fail, and this is inevitable. We can’t predict which will fail, but as long as we accept that failure is a possible outcome then we can take steps to reduce the occurrence of failure and mitigate its consequences.
Ripping everything out and replacing it wholesale is hard, but smaller changes are easier. Changing only one thing at a time is more manageable, and ensures changes stay small.
We can leverage this to our advantage by finding “seams” in the system we want to change.
These seams are the places where we could make a cut, add a small change, and stitch it all back together without having changed the whole system all at once.
Changes can be decomposed into discrete steps to be implemented gradually, after establishing that the previous step didn’t cause instability to the overall system.
For instance, if we want to upgrade Point of Sale equipment in a store, we will drastically reduce risk by replacing only one terminal at a time. If it fails, business can still go on because every other terminal is still working as usual.
When one of the discrete steps turns out to be problematic, it’s only a real problem if we aren’t able to roll the system back to its previous state, before the problem was introduced.
Building the new system alongside the existing one, and ensuring both can operate alongside for some time ensures we aren’t closing any doors as we progress through the plan.
If something goes wrong, the ability to revert the change quickly and effortlessly is crucial to minimize the impact.
Organizations that have hard “points of no return” as part of rolling out mission-critical changes invest considerable resources to ensure the change is highly likely to work out. This is an option, albeit a very expensive one that very few organizations can afford. For all others without such a considerable budget, avoiding points of no return is much more realistic and cost-effective.
Data should be backed up as part of the rollback strategy, but backups could also mean having redundant equipment or resources on standby.
For instance, if your organization is switching over to a new, electric delivery vehicle to replace its (polluting but tried and true) internal combustion ones, it could make sense to hire extra drivers temporarily and have them on standby with the older delivery vehicles, ready to go and take over if the new ones fail.
Meeting the broken down delivery vehicle and reloading its cargo onto the backup vehicle will take some time and delay deliveries, but only by a few hours rather than for days while the new vehicle is out of commission.
Once the major issues have been ironed out, it won’t be necessary to have so much redundancy anymore; the organization doesn’t commit to these costs beyond the initial phases.
Users fall on a spectrum from innovators to laggards, according to the technology adoption life cycle.
An organization can use this to its advantage by recruiting innovators and early adopters to try out the new change (with the appropriate warnings) while keeping the majority of users on the current version.
This gives a chance to try out the change, iron out the kinks, train internal people on the new version, and get feedback from users on the planned changes.
If anything goes wrong, it will only affect the small proportion of users who volunteered to try the change. With a rollback plan in place, it is no problem to revert these users to the current system.
Feature flags are an elegant way to implement this idea in software.
Any organization needs slack, doubly so if it’s planning changes.
Having people 100% busy looks good on projections and budgets, but it means there is no spare capacity available for handling unscheduled work. It also means that rushing becomes a necessity, which means cutting corners, which means that failure is both more likely and with larger consequences.
It is insidious because not having any slack appears to work when observed from a distance. Until it doesn’t anymore.
Reducing or eliminating slack might save money in the short term but it puts your organization several steps closer to failure and its associated costs (both reputational and financial). It can also increase stress on employees, increasing absenteeism and turnover.
A recent and current example is the semi-conductors crisis: there was no slack in supply chains and the result is that companies are struggling to produce enough to meet demand. Producing just in time saved money while it worked, but the costs of its failure are very apparent now, compounded with the likely economical downturn we’re entering.
Every business has a quieter season or month where business is slower, organizations should pick this time to introduce new changes.
People within the organization will have more slack time to tackle any issues that come up, and they’ll be less likely to compound mistakes as there will be less stress.
Change is scary, costly, and failure-prone. Although failure can’t be avoided every time, organizations have some control to drastically lower the inherent risk and magnitude. Using the strategies above, it is possible to implement change in a steadier, more confident, less stressful, and deliberate way.