30 Minutes Or Less

One of the areas that I'm responsible for within Atlassian is shard migrations, which is the movement of workloads (typically data) from one service shard to another for both customer facing and operational purposes.

At the start of the last financial year, one of my goals was to think about how the shard migrations experience could be better, and to then turn those thoughts into actual outcomes.

You know, because no-one pays you to just think about stuff.

What The Hell?

The mechanical process of moving customer workloads between shards was mostly solved before I started working at Atlassian and stuck my head into the area.

There is an orchestration service; it collects estimates from the other services about how much time it will take to move the customer, it aggregates all of those estimates together to set expectations, it schedules things for execution, it takes the customer offline (to ensure data consistency), and then it tracks the execution of each one of those downstream migrations, rolling everything back if any of them fail.

Despite what it might look like when summarised into a single paragraph, it is not a simple process, requiring a huge amount of effort that I can take no credit for.

The main problem with it is that it kind of assumes that the copying of the customer data can be done in a relatively short amount of time, because for that entire period, the customer is unable to use any products related to that data.

As the customer data footprint gets bigger, the customer initiated migrations become more painful, but are still mostly workable, because the customer gets something out of the equation, usually some form of data sovereignty guarantee.

Notice that I said mostly workable, because it was actually a serious blocker for some customers, especially the ones with huge amounts of data that could require days of downtime.

For the operational migrations though, the increasing amount of downtime required was a massive issue, as the customer impact fundamentally decreases the palatability of using shard migrations to resolve operational issues.

Thus, for the last few years, ever since I started at Atlassian really, there has been a consistent drive for less downtime for shard migrations.

Bringing it all back to the intro, for the financial year that just finished, we had a specific goal to halve the rolling 60-day P99 shard migration downtime for paying customers.

Is That Thing Real?

To leave you hanging, I'm not going to talk about whether or not we achieved our goal just yet, because it's important to understand the plan of attack for how we were going to achieve it.

You know, so you can appreciate the outcome even more.

The first part of the plan was to offer a new platform capability for fulfilling a shard migration that doesn't just take the customer offline and then copy the data from the source to the destination during that downtime.

The keen eyed among you may remember a blog post from last year that opined on this very topic, but to save you the effort of reading that entire thing, the general idea was to:

  • establish replication of data between source and destination shard
  • wait for that replication to reach a stable state
  • take the customer offline
  • deal with any final differences between the two data sets
  • officially cutover the source of truth
  • bring the customer back online

The system is still copying the same data from the source to the destination, possibly over the same amount of time, but the benefit is that replication completely breaks the relationship between downtime and size of data.

The second part of the plan was to optimise the actual shard migration orchestration, which, being quite a few years old at this point, had built up a decent amount of cruft and special flower logic that was slowing things down unnecessarily.

For example, one of the biggest services within Atlassian, when it was originally added into the shard migrations process, required a meaningful amount of wait time after being informed a migration was happening, in order to ensure that it correctly shut all the things down and stopped mutating data.

You know, like a Thread.Sleep writ large.

Now, there are obviously better ways to deal with that sort of requirement, but it's also entirely possible that the requirement isn't even a requirement anymore, because a lot can happen over a few years.

Suffice to say, there was plenty of room for good old-fashioned optimisation.

Yes, It's Real, And It's Going To Blow!

I swear I'm not trying to draw this out or anything, but I'm still not going to tell you whether or not we achieved our goal. Instead, lets add some seasoning to the overall dish by talking about challenges.

The first of which was that even if you create a brand new, fundamentally more awesome, shard migration fulfilment pattern, it doesn't mean that every service is immediately going to adopt it.

This is pretty normal in the scheme of things, as you can go to the effort of building a lake and then lead all the horses to it, but if they are not appropriately motivated from a business perspective, they aren't going to take a sip.

A weirder metaphor than I originally intended, but you get my drift.

In this case it was particularly frustrating though, as the nature of shard migrations means that even a single service that is not following the new pattern can stop you from achieving your goal, because they anchor the downtime for every single migration they are involved in.

As always with this sort of thing, the mitigation is clear communication, proactive prioritisation, acquiring support from leadership and all of those other things that I've talked about a hundred times on this blog already.

Also, sometimes you just have to offer to do the work for them, because you're more interested in the outcome than they are.

The second challenge was, somewhat ironically, the nature of the metric itself.

A rolling 60-day P99 for any data point is a great way to show long term trends, because the rolling period smooths out spikes and the P99 aggregates effectively to give you a sense of what most of the data points are like.

It's also pretty hard to tell if you're making a difference though, especially when you do ten smaller things and have to wait for two months to see whether or not those things actually did help in aggregate.

We mitigated this by augmenting the metric with some light simulation calculations, hypothesising what the whole picture would look like if the raw data was changed this way or that, but it was pretty high effort for not a huge amount of return towards meaningful decision making.

And Your First Thought Was To Bring It Here?

Okay, the big moment has arrived.

We achieved our goal! 🎉

The rolling 60-day P99 shard migration downtime for paying customers is half what it was when we started, which is a pretty awesome result all things considered.

The best part? We're not even close to being done yet, because we achieved that mostly through the optimisation of the migration orchestration process, as we didn't get a lot of adoption on the replication-based migration fulfilment.

But we will.

Because we're going to in order to help Atlassian achieve some pretty big things over the next couple of years, so now everyone has a lot of business motivation to do the thing that we want them to do.

And it's nice to be on the other side of that equation for once.