The Cloud: Understanding The Risks, Not Just The Benefits
It’s almost unimaginable to think about technology delivery before the era of cloud computing. Yet, it was only just over two decades ago that a typical organization’s technology infrastructure consisted of an in-house datacenter with servers storing all its data and software.
Fast forward to 2021 and over 90% of organizations have adopted some level of cloud-based software or service as part of their IT infrastructure, according to a recent story in G2, the world’s largest tech marketplace. Cloud computing has brought many benefits to organizations and technology more broadly including processing power, flexibility, lower costs and ease of software implementation and updates. However, with the widespread integration of cloud services and the relative concentration of market share among cloud service providers come significant supply chain risks and the potential for disaster. The largest cloud service providers now constitute critical infrastructure alongside communication, internet and energy supply. And the risk of outages can take many forms including cyberattacks, system failure, human error, natural hazards, and other critical infrastructure interruption.
Virtual Supply Chain Dependency
Cloud migration creates significant network and data supply chain dependencies for organizations and it’s important to consider the risks associated with cloud usage to determine the extent to which they can be managed. Most interruptions are short-lived and cause limited impact to users, however each year sees a fair number of serious outages from a variety of root causes. For example, in 2020 Microsoft Azure experienced a six-hour outage in its East US region data center, IBM Cloud experienced a four-hour multi-zone outage affecting customers in several countries and Amazon AWS suffered a five-hour outage in its US East-1 region, as reported by Toolbox.com in December 2020. Cloud outages can also result in significant revenue loss, recovery costs and legal action against both the service provider and its customers if the interruption is prolonged and causes ripple effects.
Single Points of Failure
Due to the nature of their business model, cloud service providers are considered single points of failure (SPOF), meaning the failure of their services have the potential to cause significant interruption to many of their customers at the same time. That’s perhaps best illustrated by the concentration of market share within the IaaS public cloud segment, where the top three providers control approximately 70% of the market based on revenue according to new data released by Gartner and reported in zdnet.com in June 2021.
Of course, it’s worth considering that the largest cloud service providers aren’t just operating one single giant data center, rather they’re structured in multiple interconnected and highly redundant regions with as few SPOF as possible. Nevertheless, they still experience serious regional outages when failovers cut out, and smaller cloud service providers may not have similar redundant infrastructure.
Most Common Outage Root Cause: Man-Made Hazards
According to Uptime Institute’s “Annual Outage Analysis 2021” report the most common cause of third- party service provider outages are software and configuration issues, closely followed by networking and connectivity issues. Fortunately, that also means most interruptions can be resolved relatively quickly, which is why outages rarely last more than minutes or a few hours.
Cyberattacks account for one fifth of all outages. And while the vast majority of these are short-lived Distributed Denial of Service (DDoS) attacks, ransomware has become an increasing threat to all organizations including cloud service providers. Particularly managed service providers (MSPs) have been a favorite target of ransomware threat actors because of the opportunity to infect many of the MSP’s customers in one single attack.
From an overall impact perspective, the most significant man-made threat to cloud service providers is likely to come from a cyberattack. The 2018 Emerging Risk Report on technology “Cloud Down: Impacts on the US Economy” by Lloyds of London estimated that the ground-up loss from a 3-6-day cyber related outage at a major US cloud service provider could result in a ground up loss of between $6.9B and $14.7B.
Cloud migration creates significant network and data supply chain dependencies for organizations and it’s important to consider the risks associated with cloud usage to determine the extent to which they can be managed
Increasing Risks from Natural Hazards
Although much less frequently occurring than man-made risks, natural hazards present an ever-present threat to cloud operations and several of these hazards are increasing in both frequency and severity due to climate change. That includes wildfires, flooding in non-flood zones, tornadoes, and hurricanes. Additionally, severe heatwaves can indirectly affect cloud service operations due to the strain on cooling systems and power supplies, which can lead to brownouts.
Another concerning driver of mass disruption risk in the cloud service sector is the geographic concentration of data centers in two US locations. Santa Clara Valley in California and “Data Center Alley” in Northern Virginia between them have the world’s largest cluster of data centers, many of which are cloud focused, according to an August 2020 story in the Data Center Frontier newsletter. Such concentration of facilities, particularly in disaster prone locations (Santa Clara), increases the risk of widespread outage from weather events, power outages and earthquakes.
Cloud Risk Management
Both cloud service providers and their customers can take important steps to reduce the numerous risks that can lead to outages. The first step is to determine the resilience of the cloud service operation based on factors such as location(s), redundancy, and historic uptime performance. The assessment should include critical infrastructure dependency and the ability of the provider to withstand or recover from cyberattacks and natural disasters. Any cloud service provider that has sizeable customers completes such assessments annually and many providers also have tier certifications. From a customer perspective, multi-cloud strategies are gaining popularity as a way to maintain stronger resilience.
Customers should also determine how critical the specific cloud-based service is to their day-to-day operation and what their tolerance for outages is on that basis. Many cloud service agreements include a service level provision with a performance guarantee (e.g. 99.97%), and shorter outages breaching such guarantees are often remedied through service credits. Prolonged outages and potential financial consequences are instead addressed through contractual (limitation of) liability provisions.
Finally, it’s important to determine who is responsible for what when it comes to cloud security. That ultimately depends on the individual cloud delivery model and most importantly what’s in the contract, regardless though, the main rule is the provider is responsible for security of the cloud and the customer is responsible for security in the cloud.