Spring Cleaning at OverOps: How (and Why) We Changed Our DB Cleaning Strategy

 ● 18th Jun 2019

8 min read

Posted by Aviv Danziger, one of the full stack developers on OverOps’ R&D team.

There comes a time in the life of any application that small things we let slide become unignorable issues.
For us, it happened when, after years of writing and executing code, our DB’s free disk space started to run out. Each passing day brought us closer to our eventual doom, and we finally had to allocate the time to fix the problem.

We didn’t want to go with a “quick and dirty” solution. Instead, we went for the adult approach of researching, testing and eventually implementing what was right for us. This is the story of this process, and how it made our DB so much happier.

Our Old Process

There’s no need to sugarcoat it, our old DB cleaning process was to delete records by using a WHERE clause with date ranges. Yup, that simple.

But, as you can imagine, this method had many issues. Some of these issues were so detrimental that we had to turn the whole process off. How did we get to that point? Let’s take an in-depth look into our old process, starting from the basics:

When creating a clustered index on a table in MySQL InnoDB engine, the whole table is indexed physically in the disk, which means that every time a record is deleted, all of the other records in the disk need to be reordered to keep the index’s integrity. And of course, the larger the table the slower the whole deleting process takes.

Eventually, we got to a point where it was taking us too long to delete records, and we had to turn this process off. Instead, we wanted to refactor this into an agile and easy process that’ll help us keep our DB clean without wasting too much time doing it.

When we first approached this, we wanted to buy some time to fix the process the right way. That led us to research whether a small change in the table structure would supply enough memory for our needs.

One thing that’s worth mentioning at this point, our row’s ID is a composite of a few or more columns. One of the common practices for this is having an identity table and a data table, which in return help in saving disk space.

This means that you can store 1 single ID column instead of the 3, 8 or 20 columns that this ID is composed of. In our case, our table had a composite key consisting of 4 columns: request_id, machine_hash, agent_hash and deployment_hash.

Enter the new approach we were aiming for: We created a new “ID” table holding unique records of the above tuple and a generated id for each row. We then replaced these 4 columns from the “DATA” table with the single generated id column.

To our surprise, merging these columns improved memory consumption by 27%. Pretty nice, right? But this method also increased our query time by up to 300%. Yikes. As you can guess, this is too detrimental to our app. So we went back to the drawing board.

And Now For Something Completely Different

Our second approach was a bit different – we decided to add partitions in all of our tables. Our hope was that partitions would simplify our cleanup process, as well as improve query performance.

When we create a table with partitions, the MySQL engine divides the table into smaller physical files, separate from one another. These files can even “sit” in different places on the hard drive, reducing table fragmentation.

Once we were talking about physical files, the whole deletion process became much easier. All we had to do was run DROP PARTITION {partition_name}. Deleting partitions from a table is a simple procedure since the MySQL engine can remove the physical file from the disk, no index updating shenanigans. Quick and easy.

And just like that, we’re done! Or, are we…?

Adding partitions to a table that didn’t have them up until this point means that the MySQL engine now needs to create a new table and move all of the data to that new table (since we changed the physical structure of the table in the disk).

And the bigger the table, the more memory we need to run. Memory consumption was our original obstacle, and again we were facing a memory issue with this new approach.

Help Us, RDS-Kenobi

Before running off to find a third approach to handle our database issue, we wanted to try to solve the issues with our latest solution. After some thought, we decided to write a migration mechanism that would transfer the data from the old RDS to a new one.

We started working on writing the code for this process, and several issues came up. The first one was batching. It was actually one of the more significant issues we had to face, since sending individual INSERT statements would be ineffective.

We tried using PreparedStatement’s batch function, but it didn’t group the different INSERT statements. It was then decided that we would write the batching mechanism ourselves, using simple string concatenation.

While testing the process, we found out that some tables were so big that we couldn’t load all of their data into the app’s memory. We decided we would calculate the average daily data size for each table and then used that to determine shorter date ranges for the big tables.

Luckily for us, we had an easy time testing because of RDS’s “create from snapshot” function, that creates a new RDS instance based on a copy of our production environment. Thanks to that feature we were able to test the process on our real data and see how it actually works.

It’s Time to Test This Out

As we tested this solution on real production data, we saw that the biggest change between running this in development vs. production was the time it took for the process to run from start to finish. It took about 24 hours (yep, a whole day).

After coming so far, we were so close that we didn’t want to give up at this point. Instead, this prompted us to rethink our migration strategy. We came up with a nice solution:

  • First we would move all the old data according to our SaaS data retention policy (3 months ago until today)
  • Then, a day later, we would run it again for the day that passed (the process runs for 24 hours, remember?)
  • Finally, we would flip the switch so that our backend server would run against the new RDS instance. We would then move the rest of the data that arrived since we started our second run.

Confused? Here’s a sketch:

Last but not least, we wanted to check that the migration worked properly without any glitches and that all of the data arrived safely. After some discussion, we decided to compare 2 things to make sure we have the right data: a simple row count and a random sampling of data.

This validation process had a lot of errors while we were testing it on our Dev environment, and we were scratching our heads trying to understand what we missed along the way. After excessive research we found out (more like remembered 😬) that we had a lot of garbage information in our Dev DB. This garbage consisted of some old services that needed to be deleted entirely, as well as some corrupt data from half-baked features.

Most of these problems and, well junk, weren’t relevant for our Production environment, but it was a good fool proofing for the whole process. It sure was a nice way to test our problem-solving techniques.

Full disclosure: we had another issue where we saw a difference in the sampled data between our summary tables. However, that discrepancy was OK since these table’s rows are updated quite frequently for some of our customers.

The Results

I think graphs speak louder than words, so here is the aftermath:

And our RDS graph 3 months after we implemented this process:

So, What Did We Learn?

As you can guess, this process was long and tiring, but had to be done sooner or later. We decided to stick with our second approach and work our way through it until we found a working solution that eventually paid off. Some things we learned along the way:

  • Clean up your shit – It’s always a good idea to prepare a clean up procedure to avoid situations where you’re scrambling for that extra GB of storage.
  • Always plan ahead – Create tables with partitions in advance! Besides the obvious performance gain, it’s easier to manage your tables when they’re partitioned.
  • Don’t just take our word for it – always perform research! It seems that every other day we discover a new DB engine, and with each new engine come new ways of doing things. While the basics should stay the same, it’s always a good idea to do your own research and reflect on how your architecture is different.

Last but not least, what did we gain from the process? We can now decide to change the amount of data we save with “the flip of a switch” and the DB will update accordingly.

Have you ever found yourself in a similar situation? How would you approach this problem? Let us know in the comments!

Aviv is a full stack developer on OverOps' R&D team. When he is not raiding dungeons or toppling governments under the alias Ibahen, you can probably find him at a music festival with a beer in his hand.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More