Don't Fuck The Customer.
It's one of the Atlassian values, and while some people stumble a bit over the second word there, personally, I like it just the way it is.
It's probably because I'm Australian.
It speaks to an underlying desire to really take the customers point of view into account in everything and anything that you do, regardless of where you sit in the organisation.
I've had a few customer focused experiences recently, so I think there is value in reflecting on them here.
Maybe you will to.
Why Are You Nodding?
The first experience was at the micro level.
A large global customer wanted to relocate their data to Europe for compliance purposes. This capability, to choose where the data for your cloud software resides, is known as Data Residency.
My team is responsible for Shard Migrations (aka moving customer data from Point A to Point B inside our massive pile of AWS infrastructure), so we're a key player in enabling Data Residency (DaRe), though we're not the only key player.
Anyway, this customer had a huge amount of data stored in our products, mostly in the form of attachments (i.e. documents, images, videos, etc), so it wasn't an easy operation to move it all to Europe.
In fact, it was difficult enough that they had already failed twice before I got involved :(
Now, failures to move data for incredibly large customers are not common, but they aren't incredibly rare either. When you push up against the constraints of a system, things tend to break down and you run into all sorts of edge cases that you didn't expect.
In this case, the core issue was straightforward, though it wasn't clear during those first two failures. It was a timeout inside the transfer process of a secondary system.
What the issue was isn't all that important though.
What was important was that the customer was having a terrible experience.
Every time they tried the migration, they had to shut their global business down for a few days, so every time it failed, they incurred a huge cost and interruption for no benefit.
Understandably they were somewhat disappointed, so after their case was brought to the attention of the appropriate group within Atlassian, we got to work. My team was a member of that group, but it spanned quite a few teams across Atlassian.
Together we identified and resolved the root timeout issue, improved the performance of the data migration itself to reduce their downtime, wrote a detailed document explaining and owning our failures, had an open and honest conversation with the customer and finally, organised another migration.
The third time was the charm, and everything worked as expected. The customer's data now lives in Europe and they are now able to start progressing on their own organisational goals.
I'll be honest, there were a few tense moments during that last migration though.
Because You're On To Something
The second experience was at the macro level.
In addition to Shard Migrations, the other thing my team is responsible for is Shard Capacity Management (aka making sure there is enough infrastructure to support new and existing Jira & Confluence Cloud customers).
Because a good chunk of that infrastructure is PostgreSQL (PG) databases, we need to ensure that we're running the best version of PG to support our needs, taking into account which versions AWS supports.
Speaking of which, AWS regularly deprecates older versions of PG, automatically upgrading them to a supported version once the deprecation deadline is reached.
You'd think that this would make our job much easier, but it doesn't, because an AWS mandated automatic upgrade isn't something that we have a lot of say over. It just happens.
Which means we lose control over the customer experience.
And that's unacceptable.
This was exactly the situation that we found ourselves in recently, with the additional challenge of not having picked up on the deprecation notice until it was only a month or so from happening.
The process itself is pretty straightforward, though we do have to apply it at scale across thousands of AWS resources:
- Test the new version for regressions or problems
- Roll the new version out to staging
- Do a partial production rollout (aka canary testing)
- Do the rest of the production rollout, one AWS region at a time
A PG version upgrade causes a short interruption to service for all of the customers who share the resource in question, somewhere between 1-3 minutes while the AWS services work their magic.
Historically we have not notified customers when we do one of these upgrades, the reasoning being that the interruption is so short, it's unlikely to make a difference to our users.
However, we recently re-evaluated our internal policies around downtime, so this time we held ourselves to a higher standard.
That meant that we had to notify millions of customers.
Which complicated things.
Long story short, we notified all of our customers about the upcoming downtime, and then did the upgrade itself and all was well.
Sounds simple, right?
It's A Play On Words
Now, the two experiences above don't exactly contain any earth-shattering revelations, but it's still useful to reflect on them all the same.
In the first experience, we could have done better. The customer didn't need to fail a second time before we realised that we needed to pay more attention. In a large company like Atlassian, it's often very difficult to get hold of the right group of people who can make a difference though, and this was one of those cases.
It does make me wonder how many other customers are having a similar experience though, and just aren't lucky or noisy enough to have their problem escalated to the right people.
That thought makes me sad, but I also don't know what to do about it. When operating at scale, you sometimes miss the details.
In the second experience, there was a decent amount of internal consternation around whether or not we should do notifications at all. Historical precedence is a powerful force, and it was hard to reason about if the notifications would do more harm than good.
Speaking of which, they definitely did generate a bit of noise, because the notifications themselves were not as clear as they should have been. We could have been more specific about which site was affected and more precise about the time, though we were working with tight timeframes and system limitations that made things harder than we wanted them to be.
We'll do better next time.
One thing we won't be able to change is that the upgrade notification is just an FYI. We have to do the thing and we can't consider individual preferences because of the shared nature of the underlying resources.
Sometimes that can be a bitter pill to swallow.
No Games. Just Consideration
I think there is probably only one thing that customers want.
To be listened to, to have their needs acknowledged and their problems validated, perhaps even solved. To have the people they rely on thinking about them pro-actively, without having to scream and shout just to be heard.
Customers are people first and foremost, and that is enough for them to be important.
Sometimes it's easy to lose sight of that, especially when they are often reduced to an abstract and disconnected concept, like numbers in a chart.
We would do well to remember more often.