In The End

The ability to move workloads around within a set of shards is critical to maintaining the operational health of any software system that makes use of service sharding patterns.
Just moving a single workload from one shard to another can be a complicated task, depending on the underlying technologies involved and whether or not you need to mitigate interruption to customer service.
Moving a bunch of workloads together adds another layer of complexity to that problem space, especially if the business thinks it needs to arbitrarily compose these workload sets and requires transactional guarantees around them.
That doesn't mean you have to give the business what it wants though...
One Thing, I Don't Know Why
In case you haven't already twigged onto it, this blog post is a continuation of the one from last week, but I don't really expect you to go and read that before continuing on here.
I mean, I'd prefer if you did, but I know that we're all short on time and attention, so I'm just going to give a nice TL;DR here and move on with my life
- a generic name for either persistent data storage or resource consumption that is bound to a single entity, like a user, which needs to be moved from shard to shard, is a workload
- the act of moving a workload is called a shard migration
- to be able to effectively perform a shard migration, the system needs to be able to
- schedule the migration for execution
- maintain data consistency during the migration
- execute the migration reliably
- communicate the status of the migration
- system defined grouping of workloads together into units allows for abstractions that help with reasoning about shard migrations and their relationship with actual customer experiences
- units also allow for transactional guarantees, because the system can be designed to take any necessary constraints into account, because it understands what is in the unit
- allowing for the arbitrary composition of either workloads or units into a bundle creates all sorts of problems, especially if transactional guarantees need to be maintained over the entire bundle and if the bundles are unbounded
Whew, that was a long TL;DR.
I mean, it was long enough that it almost needs a TL;DR of its own, which I suppose would be:
allowing arbitrary bundling of workloads while also requiring transactional guarantees over those bundles creates extremely difficult to solve engineering problems
The business problem is real by the way, so you can't just ignore it.
The need to compose a set of workloads into some greater abstraction is important to supporting various customer facing features (i.e. data residency) or to ensuring that the quality of the customer experience is maintained.
But bundling isn't the only way to achieve this.
It Doesn't Even Matter How Hard You Try
Instead of allowing for the composition of arbitrary bundles of the system defined units, you could instead allow for the definition of an intent, a representation of the end goal that needs to be achieved.
That's a bit vague, so to bring it home to a concrete example, one such intent could be something like:
all data belonging to customer X should be located in a datacentre that is geographically aligned with Germany
The first time a customer supplies this intent, there is likely some delta between the current state and the desired state. Some set of workloads that are not located in Germany and thus need to be migrated.
Rather than trying to execute this delta all at once (i.e. bundling the necessary migrations together) we can instead work towards it over time, moving workloads one by one, until the misalignment has been resolved.
In other words, we make the system eventually consistent.
From an engineering point of view, this means that something needs to exist which is able to analyse intents, determine the delta between the current state and the desired state and then create some set of actions to move closer to the desired state.
After some period of time has passed, it would do it again, until the delta between the current and desired state is zero and the intent has been achieved.
I don't know if there is an official name for something like this, but internally at Atlassian we call that thing a reconciler.
You know, because it reconciles.
By splitting the acknowledgement of the intent from the operation intended to enforce that intent you get a lot of flexibility and a distinct lack of having to enforce transactional guarantees around an unbounded set of shard migrations.
It does come with its own challenges though.
To Keep That In Mind, I Designed This Rhyme
The first challenge with an eventually consistent model for shard migrations is that it's really hard to set expectations around operation fulfilment.
This is most important when the end-user (i.e. the actual customer) is the one who is initiating the operation, like in the example around data residency above, but it's also important for internal use cases as well (like when a shard migration forms a core part of some incident remediation process).
The people who are using the shard migration system want to know when the thing they are asking for will be done, because either they care about the timeliness of the outcome (i.e. compliance) or, more likely, they care about when their service will be interrupted and need to plan for that (i.e. downtime).
Solving this problem in a bundled migrations world is already pretty challenging.
You know you need to migrate some set of units, you probably know how long those units will take to migrate and you hopefully know how much parallelism the system supports, but combining all of those things together along with transactional guarantees can be algorithmically complex.
It's at least conceptually straightforward though.
Not so with an eventually consistent model.
You still know the same things of course, but without the transactional guarantee binding everything together, it becomes more difficult to reason about expectation setting, especially in regard to when service will be interrupted to the end-user.
The first way to deal with this is to track how long it takes to fulfil similarly scoped migrations and then use that information during the initial expectation setting to give an approximate sense of when the operation will be fulfilled.
This is obviously not a guarantee and needs to be communicated as such, but it's at least based on past data, which is often the best indication of future performance.
The second way is to focus on minimising any interruption to service caused by the shard migration of a unit, probably by implementing some sort of replication based migration process (i.e. instead of a flat data copy).
This removes quite a lot of the need to set expectations in the first place, because once service interruption is out of the picture, everything gets a lot more flexible.
Of course, setting and meeting expectations isn't the only challenge.
To Explain In Due Time
The other challenge with an eventually consistent shard migrations model is that the system needs to be engineered for robustness in the face of fragmentation.
Bundled migrations with transactional guarantees are pretty binary; they are either in one state or the other, never halfway.
Not so with eventual consistency. It takes time to do its thing and while that time is ticking, the system is in a transitional state that needs to be handled properly.
In concrete terms, that means that depending on the level of transactional guarantee you do offer, things are going to be fragmented along some lines. Assuming that you're operating in terms of units as described above, this means that some units might be in one place (like a geographical region) while others are in a different place (i.e. a different geographical region).
So, whatever customer experience is dependent on those units and the workloads within them needs to be able to handle things like increased latency due to cross-region traffic or simple inability to access expected resources using local channels.
Complicating things is the fact that eventual consistency migrations generally spread out the interruption to service over a longer period, which means that the customer experiences not only have to handle fragmentation, but they also have to be able to handle the underlying workloads not being available at all because a migration is happening.
If your system consistently adheres to good engineering practices and never makes any assumptions about either the responsiveness or availability of your dependencies, then everything is likely to be fine.
It's pretty unlikely that that is the case though.
Most systems, especially large ones, are cobbled together over time and are barely holding it together on a good day, let alone on the bad ones that can happen during the sometimes extended eventually consistent transition from one state to another.
I have no particular solution to this challenge, other than to be aware of it and to constantly reiterate it to everyone involved so that they understand they need to engineer things to a higher standard to avoid causing customer pain.
That Time Is A Valuable Thing
I've been thinking about bundled migrations vs eventually consistent migrations in some form or another for at least a year now, always mulling it over in the back of my mind and trying to figure out if one approach is better than the other.
I still don't know.
Each one has positives and negatives, and I think the only reason I'm leaning towards eventual consistency is because it expects less from the platform and thus seems like a more solvable problem, because that's where I live.
Ultimately that's probably not really the best way to make a decision, but sometimes you have to make a decision in order to avoid getting stuck and just not doing anything.
Even if it's not perfect.
Member discussion