How to handle SaaS product outages like a boss

Like many millions around the world yesterday, I awoke to Google experiencing a rare outage. First, my ‘Hey Google’ requests went unanswered. Then I tried checking my email. I tried submitting an online form to my child’s school, who would be absent that day. I needed to check how this would affect my day, so I tried to open my calendar… ugh, fail. 

Google’s outage lasted about an hour, with many of its most popular apps — Gmail, Docs, Classroom, and others — crashing. “The root cause was an issue in our automated quota management system, which reduced capacity for Google’s central identity management system, causing it to return errors globally,” the company reported. “As a result, we couldn’t verify that user requests were authenticated and served errors to our users.” 

As Product Managers know, the promise of 100% uptime is a non-starter. But for a demanding customer, that may come as a surprise. So it’s essential to understand how downtime will impact customers and how your relationship will be affected in the long term.

In most organizations, it falls upon the IT or DevOps team to maintain your app’s uptime. And not only that but more granular website monitoring (think errors in a shopping cart experience). However, the Product Manager has the responsibility of setting expectations and by extension pricing with customers. A system’s reliability has increasingly become a deciding factor for SaaS buyers, so having a compelling guarantee for your customers can be a strong selling point. For customers who have minimal daily interaction with your product to those who are wholly dependent on your offering’s nuts and bolts. Like a power cable struck by a falling tree, a downed API can be devastating for the end-user.   

The cost of uptime

The gold standard for uptime, frequently called the ‘5 Nines’ refers to the hallowed 99.999% uptime guarantee. But in reality, your web app doesn’t always have to meet these requirements. Instead, they are typically instruments in a Service Level Agreement — virtually an insurance policy — that promise certain benefits (refunds, etc.) should these standards be unmet. That said, if your app repeatedly misses the promised Availability Level, somebody will notice.

Image credit: https://blog.paessler.com/why-uptime-does-not-mean-availability

‘5 Nines’ is undoubtedly a worthy goal, but it may not be necessary. For instance, a free or low priced product with low downtime impact (think an RSS reader) may promise a ‘3 Nines’ and be priced accordingly. But then a SaaS product powering health care records in a hospital may have to promise ‘5 Nines’ but will be in a position to charge a handsome premium. Many products now have a tiered pricing structure, where the Availability Level will decrease the less you are willing to pay. 

For products on cloud services such as AWS or Microsoft Azure, it’s interesting to note that the promised uptimes specified in your pricing and Service Level Agreements (SLAs) are entirely dependent on the services above. 

Minimizing impact with your customers

Should the unspeakable happen (most likely as you tuck yourself into bed for a long night’s rest), the deluge of frantic calls will arrive. How you respond will significantly impact how your customers identify you in the future — as a forward-thinking organization that calmly and transparently communicates how an issue is progressing or a confused and frantic organization that hides behind unanswered calls and emails. 

You might think this is a DevOps, Customer Success, or Sales problem, but you’re wrong. If you are a true believer in modern Product Management like me, you’ll agree that we are the ones who know the customer best. And as such, you are responsible for setting expectations (later communicated by marketing, sales, execs, etc.). If these expectations are shattered, or if there is downtime and you are still meeting your Availability Level, you need to react appropriately. Which you can do by openly liaising with your IT crew, assessing any damage done and communicating this to relevant stakeholders, speaking directly to customers when necessary, and ultimately settling any outstanding issues in the longer term. 

The cost of downtime

Your product will likely have many interdependencies, primarily if it depends on feeding on and to an API. If your product or your partner’s product goes down, the data stops flowing. Although a good engineering team will build a system prepared to deal with these eventualities, it’s incumbent on the Product Manager to discover the source of the problem. And to work with partners to make an action plan for when it happens again.

Assessing the cost of this kind of interruption will be complicated. You will need to consider many things, including engineering resources, the impact it will have on your product’s pricing and reputation, structural updates that need to be made, and, of course, your time. 

Beyond APIs, it will be near impossible to get an accurate measure of the damage your customers will experience if your product simply fails to work. But you can still make efforts to speak with them and to understand key areas that were affected. This knowledge will empower you to improve your product and possibly develop other differentiators from your competitor’s offerings.

The takeaway

As the customer whisperer, the Product Manager is the one that sets expectations and, when they are not met, handles the fallout. The important thing to note is that this doesn’t have to be a frantic experience for each party. A watertight, transparent, and easy-to-digest SLA will prepare the customer for the inevitable, and a measured response will ease their predictably over-the-top reactions.

More to read

How Website and Uptime Monitoring Leads to Happier Customers

For SLAs, there’s no such thing as 100% Uptime – only 100% Transparency

High Availability (Wikipedia)

Default image
Paul McAvinchey
For over 15 years, Paul has been building and collaborating on digital products with fast-growing startups and global brands, including AOL and WMS Gaming. Currently, he's a co-founder of Product Collective, a worldwide community of product people. Members collaborate on Slack, meet at INDUSTRY: The Product Conference, listen to Rocketship.fm, learn at Product Interviews and get a weekly brief that includes best practices in product management. In recent years he led business development at DXY, a leading product design firm in the Midwest, and product innovation at MedCity Media, a publishing startup acquired by Breaking Media in 2015.