Estimated Reading Time: 5 minutes
I met Benjamin Spector in one of the recent Agile Boston meetings. He told me a story that I liked a lot since it brought to life one of the key concepts I presented – The Transaction Cost vs Cost Of Delay curve (from Principles of Product Development Flow by Reinertsen). I was able to persuade Benjamin to write up this story…
Waste Deep in Being Done – Or…Why it’s Shorter to Take Longer
“We should have finished a month ago.” That was the item with the most votes during the team’s last sprint retrospective. This last sprint completed the final production-ready feature the team had been working on. It was delivered just in time for the scheduled annual major product release. Everyone was decompressing from having faced the possibility of delivery failure. But even as we celebrated our success, there was a sense of disappointment that we had taken as long as we did.
I had been the team’s scrum master for about 6 months, starting up with them as this latest project began. The team was small with 4 developers (3 full-time software engineers and 1 QA specialist) plus a product owner and a scrum master…me. When I started working with them, the team was practicing scrum at a beginner-level. My initial focus was mainly on getting the team to perform basic agile practices more effectively for estimating and planning, as well as for daily standups, sprint reviews and retrospectives.
Over the course of the project the team was dealing with regression failures that came back to them 1-2 weeks, or longer after their original code submissions. The problem was not terribly taxing on the team in the early stage of the project. We’d get 1 or 2 regression failure bug tickets and just included them in the next sprint backlog to fix them. Sometimes we didn’t get to fixing a regression failure until 2 sprints down the road. It didn’t seem like any harm to kick it to another future sprint. It was tolerable…or so it seemed.
The team’s practice was to submit production-ready code after running a suite of “smoke test” regression tests. The product itself was a complex CAD tool with over a million lines of code and up to 25,000 separate automated regression tests. Running the full suite of tests was an overnight process. Whereas, running a smaller subset of selected regression tests that focused mainly on the part of the code base the team worked in was a common practice among all our scrum teams. It allowed for quicker turnaround. In general, it was felt that running the regression “smoke test” suite enabled everyone to deliver quicker at a relatively low risk to the product quality. If a couple of regression failures slipped through the net, no one thought it was a big deal. In fact, this practice was explicitly called out as part of my team’s definition of done for user stories.
But, the frequency of regression failures began to increase. As we got closer to the release deadline, there were more regression bugs to fix, and the time spent fixing them consumed a greater portion of the team’s capacity. As the scrum master, this issue did not go unnoticed by me. I wrestled with the question of when it would be the best time to raise it to the team. We were within striking distance of our goal and the team was focused on finishing the project and complying with all the acceptance criteria for the project. Significantly, one of those criteria was delivery with zero regression failures.
About a week before we finished the project, I began reading Jeff Sutherland’s latest book “Scrum: The Art of Doing Twice the Work in Half the Time.” I came to the chapter called Waste is a Crime, and a section called Do It Right the First Time, and the words leapt out the pages. Sutherland gives a powerful example with Palm Inc.’s research on the amount of time taken to fix bugs not found and fixed right away (page 99). The research showed that it took 24 times longer to fix a bug a week or more after code submission, than if it was found and fixed at the time the developer was actively working on the code, regardless of the size or complexity of the defect. 24 times!
So there we were at the end of the project, with everyone experiencing the elation and relief of a successful delivery mixed with a sense of disappointment that we did not finish as quickly or as cleanly as we had expected. “We should have finished a month ago.” Why didn’t we?
It was at this moment that I jumped up and said, “hang on just a second. I’ve got to get something at my desk that will be really interesting for everyone to hear.” I bolted out of the conference room, ran to my desk grabbing the Sutherland book, and returned slightly breathless with the page opened at the passage describing the Palm Inc. research about bug fixing taking 24 times longer. The gist of the Palm Inc. story was about one and half pages long, so I asked the team’s permission to read it aloud before we continued with our retrospective discussion. Everyone agreed with some amusement and curiosity about what I was up to. When I finished reading the passage, I could see the impact in the eyes of every team member. Each member of the team began looking at each other recognizing this shared insight. That’s the moment when I knew I had their attention.
I put the question to the team, “How many regression bugs have we fixed since we started this project?” The answer was 35. I had already sized the problem when I began monitoring it closely over the last 3 sprints. I quickly showed them my query in the Jira database, displaying the search criteria and the tally of regression bugs on the conference room overhead projector. Everyone agreed that it was a valid.
Then I asked, “On average, how long does it take us to fix a regression bug?” We started opening up individual records so we could see our tasked-out estimates for each one. Examples ranged from 8 hours to 16 hours typically including tasks for analysis, coding, code review, adjustment of some of the tests themselves to accommodate the new functionality, submission for regression testing and final validation. Some took a little more time. Some took a little less. After a few minutes of review, the team settled on the figure of 12 hours or work per regression bug. So, I did the simple arithmetic on the white board: 35 x 12 = 420 hours. Then I applied the “24 times” analysis: 420 / 24 = 17.5. I said, “If the rule holds true, then if we had fixed the regression bugs at the time they were created, in theory it would only have taken only 17.5 hours to fix them, not 420 hours.” Then I doubled the number just to make everyone a little less skeptical. 35 hours seemed more reasonable to everyone. Nevertheless, it was still a jaw-dropping figure when compared with 420 hours. While I stood pen in hand at the white board, everyone on the team sat in stunned silence. While they were absorbing the impact of this new insight, I took to the whiteboard again and wrote down 420 – 35 = 385 hours. Then I reminded them of our sprint planning assumptions. “Based on our team’s capacity planning assumptions, we plan for 5 hours per day per person for work time dedicated to our project. For the 4 of you that equals 100 hours per week of work capacity.” I completed the simple arithmetic on the white board showing 385 / 100 = 3.85 weeks, underlining and circling the 3.85 weeks. Then I pointed back to the retrospective item with the most votes, I said, “There’s your lost month.”
When our retrospective ended we left the meeting with a significant adjustment to our team’s definition of done. We replaced the “smoke test” regression testing requirement with the practice of always running the full regression test suite on the code submitted for a story and resolving all regression failures before considering the story done. This change was made with the enthusiastic and universal agreement of every team member. Everyone recognized that it would take longer to finish each story up front. But, they were happy to accommodate the extra time because now we all knew, without even the slightest doubt, that even though it would take longer to finish the story the first time, it was always going to take a lot less time than having to go back and really finish it later.
Benjamin Spector has worked as a software product development professional and project manager for over 20 years. For 4 of his last 9 years at Autodesk, Inc., he has worked as a scrum master for several teams and as a full-time agile coach introducing and supporting agile practices throughout the organization. Reach out to him on Linkedin