Migrating blind

We had to make substantial changes to the data structure of our legacy codebase. None of us had more than a few months of experience with working on this codebase, and the only thing we knew for sure was that there were millions of lines of dead code in there. And the changes were time-sensitive, too. How do you move fast in this kind of minefield?

We had no alternative to migrating a crucial set of data blindly: We did not know nor understand all the ramifications and side-effects this might entail. We could not take the time to thoroughly understand these things while at the same time we could not afford big bugs or lengthy downtimes. It was the core business logic, after all.

So instead, we spent much time discussing how to mitigate the risks of migrating blindly, and we picked this strategy:

First, we deployed the new data structure in parallel. We added all the code needed to handle data the new way and then modified the old code to self-correct the old data. Whenever the old models would be instantiated, they would transform their data and store them in the new structure and then proceed to self-delete and pass handling on to the new code. Most importantly, though, they would send a message to our error handling service, thus giving us a good overview of all the places in the code, where the old models were still in use.

That became our to-do list to clean up all the left-over pieces, and it's also where we had the most surprises, finding areas that we would never have thought of, and that might well have taken us down.

Slow is smooth, smooth is fast.

All in all, this approach went much smoother than expected, despite taking some more time initially. It did save us a lot of work, headaches, and angry stakeholders down the road.

Forced clean-up

We were worried, we might fall into that same trap with this, too, abandoning it before we had finished all the work. That company had a history of half-baked, half-done re-writes and migrations, and to address this one colleague came up with this great approach:

We picked a date and made sure all tests relating to self-correction would fail after that date. We also added a self-destruct mechanism to the legacy code that needed to be removed: a week after the test failure date, this code would error when being loaded, effectively killing our environment until we removed it.

This time bomb helped us make sure we cleaned up even the last deprecated code and database tables left.

By @cmw@dysfunctional.technology