What I Learned From Data Loss

Adam Savard
7 min readMar 29, 2022

I’ve been meaning to do a post-mortem on this ongoing situation for over a week now; there always seems to be something that comes up that’s more important to deal with than talk about a stupid mistake in software development. But, I’m sitting here waiting for delivery, listening to Enya, petting my cat. I have nothing but time, so let’s tell a story about data loss!

Note: all this happened due to a string of poor decisions and technical debt, and the reality of deadlines. No one person is responsible for it, as every dev should have seen the issues and corrected them as they arose.

I work for a company that, up until this point, hasn’t really had the time to implement many best practices, because doing so would require more manpower than is available. If that sounds like a startup to you, you get a gold star! We’re just coming out of the startup phase, so now we get to think about many different things that we genuinely couldn’t before.

While security has always been on my mind, for example, the reality is that we don’t have massive botnets attacking our servers, trying to brute-force their way into getting API keys, scraping data, etc. While I think about performance when it comes to our application, especially on the API/DB side of things, the reality is that with less than 100 main documents, it doesn’t make much sense to worry about serving large data chunks at lightning speed, especially when our average daily user count is quite literally 0. We’ve gotten some contracts, but nobody depends on our software. Not yet.

While this is freeing in many instances, the weight of “we can wait until later” has certainly grown logarithmically, and we’re starting to see that maybe some of the platform is cracking ever so slightly.

The Architecture

So to understand the problem I and another staff member discovered, I need to explain a little bit about how our architecture works.

Without going into excruciating detail, we run a MongoDB instance that has about 10 distinct collections. While each of these collections are linked together in some way, the link isn’t enforced at the database level, but rather through the abstraction provided by Mongoose, which our API uses to interface directly with the database.

Our API builds a series of JSON objects that are largely 1:1 with the way they’re stored in the database. The structure is, in a nutshell:

A flowchart representing the layout of the API/DB structure

The important part to know here is that each of these collections, except for Campaign Scenarios, are all independent, and have no actual working relationship other than the ones defined in the API. The collections can take multiple different, completely optional fields as well. Since these fields are optional, the Database and API will not throw any errors if there is unexpected/missing data that the client software expects. You might be starting to see the issue.

The Setup

Scenario Frameworks are very basic documents. They contain a small number of fields that are used to populate defaults for each campaign that is related to them; this allows us to quickly update `n` number of campaigns by changing a value in a scenario framework and issuing the proper API request to automatically take care of updating all affected campaigns.

When we decided to re-implement our internal CLI software that manipulates the API data, one of the things that was missed was a small plugin to update/cascade the Scenario Framework changes. We made a ticket for it and decided we’d implement it when the need arose.

Pesky reality stepped in again, and we needed the plugin. We needed it yesterday; we had about 30 different campaigns that all needed updates because of a project that was going to ship in a month. No problem, it’s a really simple plugin! We’ll rewrite it and run things ASAP.

So, one of us rewrites the plugin. It seems to work just fine; you change the data in one scenario framework, it cascades and updates every affected campaign. Perfect! Let’s pull that into master and start using it.

A Small (but important) Note

So something to note about the way our development works; we have 3 different databases. Our local installs, the “master” default database which is used for sharing data between devs and production, and the production database. Generally, our workflow is make local changes ==> push to master DB ==> pull changes into production.

So there’s a small UX problem with the plugin; it only operates on our local instance. That means all the campaigns on the master database would need to be updated one at a time, and then pulled into production. That’s a lot of things to work on; not to mention, some of these campaigns actually only exist on Production, so we need a solution for that. We could do it manually, but that deadline is fast approaching; we simply don’t have time to dedicate 4 hours to making sure that each campaign received a small tweak to get a new number inserted into it.

Well, the plugin works on local, right? One small command line option and you can run the same plugin on the master DB, and even the Production DB. That solves our issue! Now we have time to figure out other things.

The Discovery

We ship the final project, and the CTO goes on Vacation. The rest of us start working on tickets and closing them as best we can. About a week and a half into this, another senior dev and I are investigating a strange bug that’s unrelated to a new PR that I’m reviewing for them. An old map of ours is breaking in the Simulation, acting like it can’t see a property it expects.

This leads us down a rabbit hole of trying to figure out why the map is breaking. Is it the PR branch? No, it happens on dev too. Does it happen to both of us? As a matter of fact, it does. Okay, well that could mean code changes sometime in the last week and a half are the culprit, as I’m pretty sure it was working a few weeks ago. So to sanity check, let me quickly check on Production, since that build is about 3 weeks old -

Oh no. It’s broken on Production too. Actually, the error is related to a property that should be in the campaign; let’s check on my local branch, and production…

The data is gone. There is no trace of it. There’s no trace of it anywhere, it’s like it never existed. It’s at this point that I start to suspect just how bad this is. I mentally retrace the steps to get the data into the Campaign, and arrive at the scenario framework for it.

The Scenario Framework is COMPLETELY empty.

There is no configuration data for the maps that use this common framework. Not every framework has this issue; many of them were largely blank anyway, mostly used in testing. This framework, though… it was our first. It’s used for almost every map we use to demo the software to clients. This isn’t just one map acting up; all the campaigns are now infected with a healthy dose of amnesia, because our Scenario Framework is just… gone.

So when I realize this, initially I think “okay, this can’t be that bad; there wasn’t that much data in there, we can just rewrite it. The object was only about a dozen properties anyway.” One slight issue with that, though… Neither myself nor the other dev can remember any of the data. After all, we haven’t looked at it in ages, it just worked. Why would we remember any of it?

Then out goes the “uh… anyone happen to have a backup copy of the data for these campaigns?” to the slack channel. The CTO might have the data, but he’s still on vacation. Can’t ask him. The guys who give demos have local installs, maybe they haven’t updated? Unlikely, standard procedure is “update everything before a demo to get the latest changes”. Okay… Well that leaves the devs. And all of us have updated. We do it almost daily.

Thankfully, one of the junior devs comes in clutch. He has an old copy of the returned JSON of all the campaigns from 3 months ago. Why, I do not know. But I will say that I was incredibly thankful for it, because that allowed us to completely rebuild the lost data. We went from data loss that may have been a major setback, to sweet, sweet reprieve.

Post Mortem

How did this happen?

Good question. The data was lost at some point, and then the updater plugin was run, which caused the changes to become permanent. Better yet, the affected framework was pushed out some time later, and when the plugin was run on the master/production DB, the infection spread. As best I can tell, this is the series of events.

What could we have done to prevent this?

- Never, never, NEVER run code on Production and Master DBs

- Isolate dev processes more from critical systems to allow for these things to be caught earlier

- Check that every map works after data manipulation, not just the ones you’re testing

- Better testing practices in general

What did we learn?

So… backups. They’re important. We have backups of the data that goes back a month or so, but the process of restoring to a backup means rolling back multiple small changes and that can be just as damaging as data loss.

We’re thinking through a way to back up everything at once, and also roll back small changes without becoming data horders.

Data manipulation is destructive, and should never be taken lightly. We knew that, but what was head knowledge has become a deeply-rooted soul knowledge.

--

--

Adam Savard

A software developer living in Canada, with experience in JavaScript and the Pixi.JS framework