The Tsunami Isn't The Problem. It's The Tidal Wave Of Unreliability That Follows.
Posted
We spend a ton of time hunting tiny errors that crash systems – a misconfigured setting or a rogue code push can create problems hidden for days. We tackle these operational errors with a retrospective or a root-cause analysis which, over time, helps us build layers of protection. We also at least tacitly acknowledge that those layers of protection can only address what we control.
Unfortunately, threats to your cloud reliability aren't always buried in the code, generated by human error, or solved by a root-cause analysis. Macro cloud economic shifts are emerging, transparent but unavoidable, that could impact your uptime far more than any accidental code push will. Next Signal is tracking these demand-driven forces because history shows they'll have a massive influence on your cloud's resilience. Are you ready for the bigger picture?
Data Tsunami
Do you feel like the world around you is moving at an unprecedented rate? It is, and here’s some context. A popularly cited estimate sets daily data creation at 402.74 million terabytes (402 exabytes). This includes data that is newly generated, captured, copied, or consumed. That number is expected to ramp considerably over the coming years. For some perspective, a typical laptop might have 500 GB of data storage available. Using that figure, the world creates about 805 million laptops worth of data - every day. There are a number of factors contributing to this. The ease and increasing commonality of creating video content, the proliferation of social media, and increasing interest to document every aspect of our personal and professional lives are among the biggest factors.
What’s also changed is our willingness and ability to store massive amounts of data. If it weren’t for the mass emigration of data from on-prem to the cloud, its consolidation of computing resources, and economies of scale, we would have hit impediments to this growth long ago. The cloud has been a cheap and relatively reliable home for this data explosion, but that data swell is building faster than we can pile bags of sand.
AI Tsunami
The reason we save data is because storage is relatively cheap, because it might be valuable down the road, and increasingly because we can monetize it. We’ve gotten a massive incentive injection with the proliferation of AI tools that can extract value from data. AI, which can analyze and summarize terabytes of data faster than human thought, is now also creating huge amounts of data for us. According to a Forbes report in March, AI is creating 34 million AI images each day...that’s a lot of creative robots. While overall cloud spend is projected to grow by over 20% annually, AI is expected to grow at 1.5 to 2 times that rate in the coming years. This aligns with Satya Nadella's prediction that the underlying AI models will become a commodity. He believes the real competitive advantage will be found in how businesses use their own data and workflows to steer and fine-tune these models. The more data, the better.
Data Center Tsunami
We are generating, transmitting, and consuming more data than ever and the speed at which we are doing so continues to accelerate. What are the ramifications? Look at the infrastructure growth plans of the largest technology companies in the world. All of them are investing billions of dollars into facilities that will accommodate the resource needs of exploding AI and data consumption. In evaluating some of the most ambitious data center projects under way Christopher Helman from Forbes provides some astounding information about their scope. On Meta’s 2250 acre “Sucre” build in Northeast Louisiana Helman says, “To power racks lined with thousands of Nvidia H100 GPUs, Sucre will require 2.23 gigawatts of 24/7 electricity (enough to power more than 2 million homes), which utility company Entergy will provide via twin high-efficiency natural gas turbines at a capital cost of $3.2 billion.” For some perspective, the city of Los Angeles contains about1.46 million households. We are talking about a single facility that will consume ~36% more energy than all households in Los Angeles.
Helman projects America’s tsunami of advanced data centers demanding an additional 81 gigawatts of electricity by 2030, roughly the current power consumption of Texas. If this trend in data and AI power consumption continues, the utility companies we know of today may be dwarfted by the tech-utility partnerships of tomorrow. Spurred by Microsoft’s commitment,Constellation Energy is investing $1.6B to restart 3 Mile Island in Pennsylvania and Microsoft is planning to consume all 835 Magawatts of electricity. Amazon is putting data centers right next to the Susquehanna nuclear plant in eastern Pennsylvania and they are working on a deal to consume 40% of the 960 megawatts. If that wasn’t enough, Stargate, the new venture between OpenAI, Oracle, and Softbank committed to invest $500 billion into 10 gigawatts of AI infrastructure in the U.S. over the next four years.
Reliability in the Macro
The aggressive datacenter expansion isn’t likely to get us ahead of demand, these are plays to meet the currently projected power and infrastructure demands generated by data proliferation in data and AI. Today US data centers consume roughly 4% of the total power consumption but estimates suggest that this will rise drastically in the coming years with power consumption more than doubling to roughly 9% to 12% by 2028. These enormous projects that can play out over a decade are going to be compressed into smaller time frames to meet demand. When that happens, it’s likely that things will go wrong. We are looking at three macro factors that we expect to be testing reliability, two of which I’ve touched on above.
- There isn’t enough datacenter capacity to meet projected demand. Believe it or not, “the cloud” is actually a physical place, or more appropriately, a set of physical places where servers, switches, and fiber optic cable make the Internet function properly. We need more space for more computers and we can’t build fast enough.
- There isn’t enough power to meet projected demand. Per Chatsworth Products on considering power distribution for artificial intelligence, “while a typical CPU within a data center uses about 300 watts per hour, GPU’s can use 700 watts per hour,” and “AI queries require 10x the electricity of traditional Google queries.” Datacenters have historically been huge power consumers. As they fill with GPUs, the power consumption per square foot goes up and to the right. Time to “get creative”.
- We’ve been struggling with a chip shortage that pre-dates COVID. While the demand for lower-end processors has abated in the last couple of years, the demand for GPUs has skyrocketed. There isn’t enough specialized talent to make the chips let alone the factory capacity to produce them. Additionally, most of the manufacturing is overseas and exposed to volatility in tariffs. Ever tried to get your hands on a Nvida GPU? It’s like waiting for a Ferrari to be delivered. Assuming you can even be graced with opportunity, you'll be waiting years to have it in hand.
Kinda ironically, I’m giving the cloud providers a little grace. While these problems are all solvable over time, and I do believe this set of problems we are staring down will be solved, getting there over the next several years is going to be rocky and capacity shortages will put pressure on availability. What we don’t expect to change are uptime commitments.
Your cloud provider already isn’t meeting their SLA, and their ability to do so will be stretched by these macro forces. Next Signal keeps track of the SLAs in place for hundreds of services. At the end of each month, Next Signal helps customers understand whether their SLAs have been met and, in cases when they haven’t, helps customers make credit claims. If your organization wants more governance around SLA events and wants to be empowered to make claims, reach out to Next Signal and schedule a conversation.