4 min read

The Bad Batch

The Bad Batch
I've never watched it, but I hear it's good. All credit to Disney I suppose?

One of the most important capabilities for any service sharding platform is the ability to move workloads from shard to shard.

Moving a stateless workload between shards is pretty straightforward and not really worth discussing. All you need to do is just change any relevant routing information.

Moving a stateful workload is an entirely different beast though, because you need to move data, and moving data can be difficult and time consuming depending on a variety of factors.

And it gets even worse when you need to move a bunch of stuff together.

I Did What I Thought Was Right

The act of moving a stateful workload from shard to shard is called a shard migration.

At least that's what I call it anyway.

In order to successfully execute a shard migration for a single workload, you need to be able to:

  • schedule the movement of the workload, setting appropriate expectations with anything that depends on the workload so they can react accordingly
  • provide ways to ensure consistency of data is maintained, like removing or restricting write access to any underlying data stores or rolling any partial changes back if an error occurs later in the flow
  • actually execute the movement of the workload, calling into necessary services or infrastructure components to move things from shard to shard
  • communicate clearly the current state of the process to any interested stakeholders, regardless of whether they are processes or people

It's rare that you only need to move a single workload though.

Instead, it is far more likely that you're going to have to move multiple workloads at the same time in order to achieve some larger conceptual goal, like moving a user from one geographical region to another.

A solution to this situation is to draw a boundary around a static set of workloads and identify it as a consistent unit of operation, offering all of the capabilities that I mentioned above at that level, rather than at the level of individual workloads.

As is typical, this works great until someone wants to do something slightly more complicated.

Like arbitrarily bundle units together.

You Never Could See The Bigger Picture

Whenever you choose a grouping construct to abstract the details of an underlying system there are always going to be people and teams within the business who need a different grouping construct to achieve their own goals.

Just bundling things together isn't really enough though, because as a user I want the same sort of functionality and guarantees on my bundle as I would get if I used the system defined units of operation.

That means:

  • bundles need to be scheduled, with appropriate expectations set for any system or people that care about the bundle
  • data consistency across all of the units in the bundle needs to be maintained, likely manifesting in some sort of transactional guarantee (i.e. either the whole bundle happens or none of it does)
  • bundles must be executed as the aggregate of their constituent parts, with the units (which are broken down into workloads) doing what they need to do to create a successful shard migration.
  • the status of bundles must be communicated, clearly indicating whether they are progressing, when they will be done or if they have failed (either partially or in totality)

Which is where things get complicated.

Now, Surrender

At first glance, bundling multiple units together doesn't seem like it is fundamentally more complex than bundling workloads together to create those units in the first place.

Decompose the units into their constituent workloads, execute that set of workloads in a transaction and the job is done, right?

Nope.

The moment that your units were birthed into existence they gained an identity of their own. I mean, that was pretty much the entire point, to have a higher-level construct that could be reasoned about more easily.

This means that decomposing them isn't always possible, because there might be things (constraints, hints, assumptions) that exist only at the level of the unit which need to be taken into account for the operation on the unit to be successful.

That's challenge number one, and the potential solution is to ensure that when the units are bundled together, not only are you dealing with the workloads in those units, but you're also continuing to deal with whatever special things existed at the unit level as well.

Entirely reasonable from an engineering constraint, but wait, there's more!

Challenge number two is that as the size of the bundled operation increases, so too does the complexity of executing it, even when you're decomposing the units back into their constituent workloads.

For example, scheduling becomes more difficult because you have to make sure that everything is aligned. If you don't do that, you can't set expectations about the total duration of the bundled operation, which causes all sorts of headaches when communicating with users, both internal and external.

If the migration capabilities also have concurrency or throughput limitations (i.e. only X migrations for workload type Y can be executing at the same time), the scheduling problem becomes even more difficult, because you now need to solve for limited availability of resources.

Lastly the ability to offer transactional guarantees around the bundled operation (i.e. either all workloads move to their destination or none do) further complicates the entire scheduling and execution process, because you can't move individual workloads without making sure that you're moving them all and you can't complete a workload (and reopen it for mutation) until everything is ready to complete.

And don't even get me started on the increased chances of an individual workload failing, causing a cascade failure and rollback on the entire bundled operation.

A way of dealing with this is to limit the scale of the bundled operations that the system is capable of performing. This reduces the flexibility of the bundling capability, but makes it easier to deal with the inherent engineering problems that bundling can cause.

Feels a bit weird to do that though.

I Guess I'm Disobeying That Order Too

I'm living and breathing this particular problem space every single day, so my apologies if the content above isn't particularly coherent. I suspect that some of the problems I've mentioned are very specific to the service sharding platform withing Atlassian and the shard migrations capability in particular.

The intent is to highlight that any transactional operation like a shard migration increases in complexity rapidly as the scope of the transactions that the system must enforce increases. This, in turn, is further compounded if you offer flexible bundling that is controlled by the user, which also must be transactionally enforced.

So, maybe just don't offer bundling at all?

Which is a perfect cliffhanger, because that's exactly what I'll be writing about next week :)