AI data centres: Reliability vs Availability

Or why a reliable Tier I data centre may be just fine for AI

Data centres for AI workloads are all the rage nowadays. No matter where you look, you will find someone deploying or planning to deploy a data centre to host GPUs. They will use them to run Artificial Intelligence workloads, for model training or inference. And one of the questions they need to answer is how to design these data centres and, as part of that, how much redundancy to include.

Data centre design practices evolved from a “wizard’s apprentice” task to a well established practice during the last 20-25 years. Many of the concepts that established the practice originate from the Uptime Institute‘s Tier Standard, a simple yet powerful definition of four Tier Classifications. They define how data centres can achieve high availability. Other organizations and institutions proposed alternate, yet usually similar, standards such as ANSI/BICSI 002, ANSI/TIA-942, EN 50600. Tier III is probably the most popular data centre class, and it’s finding its way into data centres developed for AI workloads. But is Tier III a good fit for AI workloads?

Tier III in a nutshell

The Tier Standard is so established in the data centre community that it’s frequently used (and often abused) when characterizing data centre performance. Data centre practitioners will agree that the minimum “acceptable” data centre is, usually, a Tier III compliant data center. There are very good reasons for it:

It includes redundant components so, when something fails, something else can take over.
Operators can isolate and maintain any component, which allows timely maintenance, without downtime.
The redundant components make it possible to survive certain failures – although for Fault Tolerant you really need the next class, a Tier IV.

The attributes above make perfect sense when your IT infrastructure can never go down. It is what banks, serving millions of customers all over the world every minute, need. It is what a simple online shop, selling goods to thousands of customers per hour, must have. To understand why, we need to look at how transactional software applications work.

Transactional applications and Tier III data centres

A typical software application is a program that runs on a computer, just like the programs running on the computer you use to read this post (or the apps on your phone, which is also a computer nowadays). The difference is that hundreds, maybe even thousands of users are using that same computer (which we call a server) simultaneously. If this computer suddenly stops, all these users will lose what they were doing. They may get angry and never come back. Nobody wants that reputational and financial loss.

Not only that, but the program running on that computer, used by multiple users, cannot easily move to another computer. Because of how it is made and installed; and because it depends on other programs too. Therefore, it’s better to keep that program running. Always. Even when it has nothing to do. Because someone may decide to use it and if it’s not there they’ll be disappointed and angry. It must be “available” all the time, otherwise said “highly available”.

To keep it running, we need to feed it with power and to cool its environment. Always. Non-stop. And to accomplish this, we need a data centre that is always on. Enter the Tier III empire. A faulty cooler? No problem. You can switch it off, repair it, bring it back on. A leaky valve? No problem, you can isolate it, repair it or change it, then restore the service. A faulty breaker? No problem. You can even survive certain failures, but you are better off with a Tier IV for that.

Another aspect of equipment running transactional applications, is that they usually consume almost the same power whether they do something useful or not. This makes it easier to have spare capacity: you use most of your capacity constantly over time, so you can add something extra for redundancy.

The “cost” of the high Tiers

They say there is no such thing as a free lunch, and the high Tiers are no different.

To have a Tier III (or even a Tier II) you must have redundant components. Redundant means that they must be there but you don’t really need them to get the job done. The system works fine without them too. You only need them when you need to switch off one of their counterparts. So, essentially, 99% of the time you can live without these components. And if you have them already, you could be using them, right? Wrong, not when running a Tier III (or Tier II), you are not supposed to. If you do, you lose that redundancy that makes it a Tier III. And in the case of AI data centres, these can amount to Megawatts of unused capacity.

Not only this, but your design must also enable you to isolate every single component without downtime. Even those parts unlikely to face a problem. To achieve that, you must add components. Like valves, which fail more often than pipes. You increase cost and with it you increase complexity. With complexity you also increase the frequency of failures. That’s not a problem for a Tier III, because you can fix them without downtime. But you need more labour, more eyes to check for failures and because of complexity the chances for mistakes also increase. And human mistakes are the most frequent cause of downtime, as multiple studies show, for example here.

And then there is space. With densities ever increasing, the area required for power and cooling equipment is a lot more than the space GPU racks are consuming, even without any redundant components. Add redundancy and the balance shifts even more.

The nature of AI (and High Performance Computing) workloads

AI workloads are very different from transactional.

Unlike transactional workloads, an AI workload means that a single user, unknowingly, starts many programs (processes) on multiple computers, at least for a few seconds. These processes are ephemeral: they will be destroyed when they finish and new ones will start when needed. They don’t care on which physical system they run. Another program (the job scheduler) starts them when needed. When they finish, they write or send their results and disappear. If a job fails, the job scheduler will notice and send the same job somewhere else.

The power consumption of machines running AI workloads is anything but stable. It can triple or quadruple between idle and utilized states. This means that you need extra power and cooling capacity just to cope with that power increase, when it happens. Which means that your power and cooling equipment is often underutilized even if you didn’t allow for redundancy. Add redundancy to the mix, and your utilization drops even more.

Reliability vs Availability

It’s time to introduce the difference between two often misunderstood and confused terms: availability and reliability.

Availability has to do with the chances of having access to something whenever you choose to. No schedule, just anytime. A machine running a transactional workload must be highly available, because users, anywhere in the world, may decide to use it any time. They won’t make a booking first (but they may use it to make a booking). Just like you flip the light switch and you expect the light to turn on – no matter when.

Reliability is about something working without interruption for a given amount of time. In our every day life, we have machines that are very reliable but not necessarily highly available. Aircraft are a good example. They are very reliable when in flight (thankfully), yet they spend extended amounts of time in hangars, out of service, to go through maintenance. It doesn’t really matter, because other aircraft get the job done while others are being maintained.

Availability and reliability are not always compatible. For example, having many valves increases the availability of a water distribution system, because you can isolate any valve and repair it without removing much, or any, water. But valves fail more often than pipes, so the chances of failure increase when you add valves. In essence, reliability decreases as availability increases.

Achieving high availability usually increases complexity, which also makes unplanned downtime more likely to occur either due to failure or due to error.

So what do AI workloads really need?

AI workloads can benefit a lot from high reliability even if it’s not essential to them. If you suddenly stop some GPUs, a few users will have to wait a little longer for their results. Not thousands or millions of users. And not for too long: the job scheduler will send these jobs to another machine and get it done. Sure, you will lose some computational time but you can recover. And as training llama-3 revealed, large GPU clusters fail for reasons inherent to themselves most of the time, not because of data centre failures. They still recover. But unplanned downtime is always a nuisance, so reliability is important.

High availability is of little benefit to AI workloads. Sure, you don’t want your entire system to be unavailable. But if a segment of it is down, you can still run jobs on the remaining system. You service is still there, just running at reduced capacity.

It’s a term very familiar to the High Performance Computing community, but usually alien to the data centre folks: partial availability. Partial availability in a data centre normally means that some of the IT services are unavailable. And that can be catastrophic for transactional workloads with many interdependencies. But for a GPU cluster, having several GPU nodes unavailable does not affect the service. Users can still submit jobs and retrieve their data. The jobs will run on other GPU nodes. The service is available. The throughput is reduced but the service is there.

There is still a portion of the AI infrastructure which is very much like a transactional load: the service and storage nodes. But these are a tiny fraction of the total load, so ensuring high availability for them is a much easier task.

The humble Tier I

Tier I is the pariah of data centres. Nobody will consider certifying anything less than a Tier III. Some will consider a Tier II if it’s enough for their needs, but rarely certify it because there is no market for it. But a Tier I?

Yet, the humble Tier I has a few things going for it.

It can be very reliable: data centre machinery is inherently reliable and the simplicity of a Tier I design can make it less prone to errors, both in design and operation. This doesn’t mean that any Tier I is reliable. But a well-designed Tier I, which went through peer reviews and was quality controlled can be.

It is more efficient: equipment tends to perform better when operating under a higher load (but not too high). And the power consumption variance of AI workloads means that for a good portion of the time the system will be operating with space capacity anyway.

But, one will ask, what happens when you need to do maintenance? The answer is quite simple: you will turn some GPU nodes off. And guess what: you would not have these GPUs running in the first place if you had a Tier II or Tier III instead of a Tier I using the exact same power and cooling components. You’d be keeping that extra capacity sitting there, unused. Only to enable you to shut it down without shutting down GPU nodes.

But, one will argue, the cost of the GPU nodes is a lot more than the cost of having that extra capacity available. True, but the value of their GPUs is not what they cost to acquire them, it’s the revenue they bring when used. And yes, you may lose 1% of it while they are down due to maintenance, but you lose 100% of it if the GPUs are not there at all.

Reliability is what counts for AI data centres

By all means, when deploying your GPUs I strongly encourage you to protect them, provide them with good power conditioning and adequate cooling. Give them a reliable and properly sized system, they need it and deserve it. Don’t go cheap. Most GPU failures, industry data shows, are due to power and cooling issues.

But high availability? Many nines? Keep it for the service and storage nodes. For the majority of the load, you can pass. Buy some more GPU nodes and use all of your installed power and cooling capacity. You can always shut down some GPUs if you need to do maintenance.

Maybe, though, to ensure reliability we should start certifying some Tier I (or Class F1, or Rated 1) data centres for a change, just to make sure they go through that valuable peer review.

AI data centres: Reliability vs Availability

Or why a reliable Tier I data centre may be just fine for AI

Tier III in a nutshell

Transactional applications and Tier III data centres

The “cost” of the high Tiers

The nature of AI (and High Performance Computing) workloads

Reliability vs Availability

So what do AI workloads really need?

The humble Tier I

Reliability is what counts for AI data centres

By admin