Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI

Schneider Electric Warns That Existing Datacenters Aren't Buff Enough For AI (theregister.com) 55

The infrastructure behind popular AI workloads is so demanding that Schneider Electric has suggested it may be time to reevaluate the way we build datacenters. The Register reports: In a recent white paper [PDF], the French multinational broke down several of the factors that make accommodating AI workloads so challenging and offered its guidance for how future datacenters could be optimized for them. The bad news is some of the recommendations may not make sense for existing facilities. The problem boils down to the fact that AI workloads often require low-latency, high-bandwidth networking to operate efficiently, which forces densification of racks, and ultimately puts pressure on existing datacenters' power delivery and thermal management systems.

Today it's not uncommon for GPUs to consume upwards of 700W and servers to exceed 10kW. Hundreds of these systems may be required to train a large language model in a reasonable timescale. According to Schneider, this is already at odds with what most datacenters can manage at 10-20kW per rack. This problem is exacerbated by the fact that training workloads benefit heavily from maximizing the number of systems per rack as it reduces network latency and costs associated with optics. In other words, spreading the systems out can reduce the load on each rack, but if doing so requires using slower optics, bottlenecks can be introduced that negatively affect cluster performance.

The situation isn't nearly as dire for inferencing -- the act of putting trained models to work generating text, images, or analyzing mountains of unstructured data -- as fewer AI accelerators per task are required compared to training. Then how do you safely and reliably deliver adequate power to these dense 20-plus kilowatt racks and how do you efficiently reject the heat generated in the process? "These challenges are not insurmountable but operators should proceed with a full understanding of the requirements, not only with respect to IT, but to physical infrastructure, especially existing datacenter facilities," the report's authors write. The whitepaper highlights several changes to datacenter power, cooling, rack configuration, and software management that operators can implement to mitigate the demands of widespread AI adoption.

This discussion has been archived. No new comments can be posted.

Schneider Electric Warns That Existing Datacenters Aren't Buff Enough For AI

Comments Filter:
  • Oh they do, do they? (Score:5, Interesting)

    by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Tuesday September 19, 2023 @10:09PM (#63862026) Homepage Journal

    We need to continue to make it possible to do more processing with less power, whether that means continuing refinement on silicon or some other thing, but right now we don't need to be adding more always-on load. How's about we just get fast residential internet in all the cold places and do that distributed computing space heater thing? You could use computing power to heat ammonia refrigeration as well.

    • by MacMann ( 7518492 ) on Wednesday September 20, 2023 @12:33AM (#63862172)

      Processing information can't have the power requirements optimized to zero, at some point we hit physical limits on moving things around to store, transmit, and process information. That quantity of energy might be moving an electron over an angstrom, or something else that is equally very very small, but it's not going to be zero.

      Given the money that can be made on lowering energy costs of processing data there's likely already considerable incentive to do better. But, again, there's limits. One example of this is with solar PV. I don't recall specifics but there was someone that came out with a new "highly efficient" PV cell that was tested at something like 20% efficiency. The problem was that with the much higher cost of this 20% efficiency PV technology everyone kept buying the 15% efficient, or whatever, PV cells because they gave the best return on investment. Maybe it wasn't 20% and 15% but the point is that even if we find something that can bring more processing with less power it still needs to come at a cost that is worth the energy savings.

      People are already moving data centers to Canada for the colder outdoor temperatures (which makes for more efficient heat sinks) and the cheap and reliable electricity from nuclear and hydro. Oh, and Canada is apparently investing big into onshore wind like their neighbor to the south since it provides cheap electricity but without the hydro and/or nuclear to be a backup there's still a reliability issue. I've heard nice things about geothermal power lately, and that might help with the power supply, especially in colder climates that give more of a temperature gradient to help with efficiency.

      • by Vancorps ( 746090 ) on Wednesday September 20, 2023 @12:40PM (#63863348)

        People are also building datacenters in Phoenix metro area. So far cooling isn't the problem. Modern datacenters can cool 20kw in a rack without an issue. Legacy ones have the issues Schneider is discussing. Think DCs that have raised floors. Our DC uses regular mini-split style to cool the room or cold isle. The magic is what they do with the heat which is a cooling unit on top of each row that is more or less a radiator they push water through. They typically do 3 units for each row to ensure proper cooling redundancy.

        So as long as you orient your gear right, all is good in the world of modern DCs. I wouldn't think that would be a problem but I run into rack all the time where they orient the switches to make it easier to connect their servers except that then means the switches are taking in air from the hot isle and exhausting into cold. So frustrating to see as recabling to fix is no fun, especially because at that point its a live patient.

        The biggest problem for me personally is that I now have to use hearing protection when I go into the datacenter even though my racks aren't doing AI. Communication while operating in the DC verbally is basically a non-starter unless you're using bone conduction mics. That AI gear is quite obviously very loud as cooling 700W GPUs requires a lot of fan internally.

        • Re switches venting the wrong way - every major manufacturer has pairs of models that differ only in the direction the fans push air for this exact reason (front vs rear cabling ports).

          And if it does come down to it, you can also just... open the switch & the power modules up and flip the fans around.
          • While a lot of switches have fan part numbers that go in one direction or another, some vent out the sides, opening them up is a good way to void the warranty on devices that can exceed 100k in cost which makes that proposal a non-starter. Frankly even a low-end 10k switch I wouldn't do that on. Maintaining enterprise support is critical in a DC where you often don't have a lot of physical access after its deployed.

            A lot of enterprise firewalls are designed to only be oriented in one direction as well. The

      • Processing information can't have the power requirements optimized to zero, at some point we hit physical limits on moving things around to store, transmit, and process information. That quantity of energy might be moving an electron over an angstrom, or something else that is equally very very small, but it's not going to be zero.

        While currently in the realm of science fiction, it is not against the laws of physics to do close to zero energy computation. Energy only needs to used when erasing informati

    • by Rei ( 128717 ) on Wednesday September 20, 2023 @05:17AM (#63862462) Homepage

      Did you miss the part where they said that the key difference with AI training is the importance of low-latency connections? And your solution is to spread training out to home internet connections in every corner of the world?

      To give a sense of why:

      First off, training is hugely memory intensive. Like, it's hard to train a 3B model with a single batch on a top-end consumer GPU with 24GB VRAM.

      But GPT-3 isn't 3B parameters - it's 175B parameters. And GPT-4 is 1,76T parameters.

      Oh but wait, there's more! Because again, there's this thing called batching / microbatching. Basically, the goal of training is to find gradients where, if the model had been adjusted along them, it would have done better at computing a given training sample. But if you just take training samples one at a time, the gradients may be wildly different from each other. You're in effect doing simulated annealing, but constantly, nonstop bouncing yourself out of the optimum. So we train in batches, where you calculate the gradients of many samples at once, average them, and use the averaged gradient instead, which can be thought of as reflecting broader truths than what you'd get from a single sample.

      Well, GPT-4 is said to have had a batch size of 60 million.

      The short of it is, you're not solving many small problems where it's just a task of throwing enough compute at it. You're solving a single, massively-interconnected problem that's far too large to fit onto a single GPU. So latency is really critical to training. To get a sense of how critical, watch the Tesla AI Day video where they unveiled the Dojo architecture. That will give a sense of how important reducing latency is.

      • Did you miss the part where they said that the key difference with AI training is the importance of low-latency connections? And your solution is to spread training out to home internet connections in every corner of the world?

        But GPT-3 isn't 3B parameters - it's 175B parameters. And GPT-4 is 1,76T parameters.

        GPT-4 is really more like a dozen or so roughly 100B parameter models. MoE is a game changer for scaling by making specialization across smaller more tractable models. Still not likely to be something you can distribute over the Internet any time soon.

        The short of it is, you're not solving many small problems where it's just a task of throwing enough compute at it. You're solving a single, massively-interconnected problem that's far too large to fit onto a single GPU. So latency is really critical to training.

        There is another possibility that can enable Internet distribution. Model merging where many independently work to train up different aspects from the same base models and then merge the results. Think Neo's stack of educational mini-discs from the Matrix.

      • Oh but wait, there's more! Because again, there's this thing called batching / microbatching. Basically, the goal of training is to find gradients where, if the model had been adjusted along them, it would have done better at computing a given training sample. But if you just take training samples one at a time, the gradients may be wildly different from each other. You're in effect doing simulated annealing, but constantly, nonstop bouncing yourself out of the optimum. So we train in batches, where you cal

  • by HBI ( 10338492 ) on Tuesday September 19, 2023 @10:41PM (#63862052)

    Liquid as a heat transfer agent has been in use for many, many years. Nothing weird about that. Most significant data centers i've run have had coolant systems with a water tower involved. The big honking air handlers need somewhere to put that heat.

    Perhaps servers will have to bring the coolant closer to the CPUs to handle the greater heat density. The issues with leakage and the complexity of already messy racks probably have prevented this up until this point. Also, the location of PDUs towards the bottom of racks may be an issue - who wants fluid leaking onto those? And oh, they do leak. Many floods in data centers over the years. Looking through a vent tile to see flowing water is not unusual and something you need to keep an eye out for.

    Mostly an engineering problem. But they want to sell more Schneider gear, so unsurprising they'd be writing whitepapers about this.

    • Time to bring in the ammonia cooling systems.

    • by r1348 ( 2567295 )

      Liquid cooling's main problem is scaling. We ran several pilots in the past years testing different approaches, but the main issue is that they exponentially increase maintenance time. Leakage was present but minor.

      • by realxmp ( 518717 )

        Liquid cooling's main problem is scaling. We ran several pilots in the past years testing different approaches, but the main issue is that they exponentially increase maintenance time. Leakage was present but minor.

        This is exactly it, it's still early days with this tech so we're still feeling our way towards what is a sensible solution. My DC runs chilled doors as standard on our 20kW racks and they take a water feed to each rack so the jump to DLC isn't as far. The problem is this size of compute is not sustainable, with 20kW racks is you can only fit 2 DGX H100s in them and that's without a switch! Ultimately we need a low maintenance solution, just like the jump we made to tool-free servers.

      • by HBI ( 10338492 )

        My response is "fittings break". Usually in stress situations, like cold temperatures outside, for instance.

        I remember a situation where I was made to hold half of a fitting together with the other half for the better part of an hour trying to staunch the flow while someone got some replacement parts and the tools necessary to fix it up, as pressurized water shot everywhere. No one wanted to shut the data center down...

        • by r1348 ( 2567295 )

          We didn't have anything that catastrophic, and we definitely wouldn't have asked a person to hold a leaky pipe over live systems for one hour.
          I think they main issue in your scenario is that you didn't have an adequate SOP, and yes, sometimes you need to shut down hardware, that's why you have redundancy and shifting procedures.

          • by HBI ( 10338492 )

            Would you believe this was IBM? It was.

            • by r1348 ( 2567295 )

              I have no difficulty in believing it was IBM, I've personally interviewed droves of SoftLayer/IBM Cloud employees willing to jump ships.

  • Sure! (Score:5, Insightful)

    by ArchieBunker ( 132337 ) on Tuesday September 19, 2023 @10:54PM (#63862062)

    Company that sells power transmission equipment and accessories says you will need more power.

  • We're using old technology to support new ideas. There's money to be had for investors who can come up with a disruptive massively cool alternative to heat loss and radical approaches to propagation delays.

    • I wonder when someone is finally going to produce actual hardware neurons. Right now we're just simulating neural nets with classic computers. No matter how much they distribute the load over many cores, it's still just processors doing calculations. If we can have the silicon equivalent of actual neurons, that would be a major breakthrough.

  • by Rosco P. Coltrane ( 209368 ) on Tuesday September 19, 2023 @11:32PM (#63862102)

    so that I can type "Make me a photo of Lindsay Lohan naked covered in grey poupon" in DALL-E.

    • Impressive, eh?
      The future must be a grand place. I'd like to see it someday.

      • Impressive, eh?
        The future must be a grand place. I'd like to see it someday.

        I am sometimes amazed at the technology available today, and the speed in which new technologies come to market. Others appear to have made the same observation and exclaimed, "we are living in the future but its not evenly distributed."

        We can see the future, but we have to travel some to see it in bits and pieces. Travel can also take us into the past, if you want to see what things were like living in some historical period then we can take an airplane ride to see what life was like going back to anythi

    • by necro81 ( 917438 ) on Wednesday September 20, 2023 @08:19AM (#63862668) Journal

      so that I can type "Make me a photo of Lindsay Lohan naked covered in grey poupon" in DALL-E.

      But, curiously, all I got back were images of Natalie Portman. And the Grey Poupon looked an awful lot like hot grits. I guess that's what you get when you train your AI by scraping the internet.

  • by aaarrrgggh ( 9205 ) on Wednesday September 20, 2023 @12:09AM (#63862134)

    For those not familiar with data center state-of-the art 10 years ago this is pretty good information. I haven't designed any AI data centers, but plenty of facilities over 40-50kW/rack and a couple over 100kW/rack, so many of the warnings are pretty old-hat.

    What is important for anybody tangentially involved with data centers is to understand the impact on peak:average workload in these facilities. It is a big deal and many things are not adequately designed to accommodate it. We used to design for a 3:1 peak/average, which would be consistent for electrical and HVAC; for well designed systems once you go over 2:1 you will have significant HVAC issues and over 1.25:1 electrical systems will need major overhauls.

  • For a given system, it's all about how much energy can be converted into heat over a unit of time.

    Watts in â BTUs out.

    There are many ways to optimize the left hand side of the equation but doing the rhs part in an efficient and stable manner remains a challenge. This is where scaling out is going to win.

  • If we are to get battery electric vehicles to replace internal combustion engines then we need to have a similar discussion.

    The usual comments about BEVs is that it's no big deal because people will drive their BEV home, plug it in, then have a delay timer on the charger to charge the car after supper has been cooked and cleaned up, everyone had finished their homework and nightly TV watching, then turned out all the lights and gone to sleep. The grid can handle this as can the power plants providing power

    • The solution for datacenters already exists. Itâ(TM)s a small reactor instead of the diesel generators. The primary problem right now is not IN the datacenter it is transport and grid problems, primarily due to green energy investments and shutting down nuclear and other base-load plants that have made it a lot more brittle.

      We have to accept, fission and eventually fusion are practically infinite power sources. If you want to live in a future like Star Trek, you have to solve the primary problems Star

      • by stooo ( 2202012 )

        >> We have to accept, fission and eventually fusion are practically infinite power sources.
        Yep. I use fission with my PV panels.

  • So we make AI more efficient. Then we just have more of it and power consumption goes up or stays the same. Or, we make compute less power hungry. Then have more AI because the power's cheaper and available. Either way, we use more power than we have and demand continues to exceed supply.

  • The hype will be over, nobody will care much anymore and the remaining few applications will be optimized a lot more an run well on pretty nirmal hardware.

  • Why is it "news"? (It's not, it's Slashfiller.)

    Data centers are tools. When they cease to serve, replace them.
    There's no shortage of money and the work will be welcomed.
     

  • by ledow ( 319597 ) on Wednesday September 20, 2023 @03:59AM (#63862398) Homepage

    Ah, the old AI adage:

    Just throw more processors at it, it's bound to become intelligent this time round!

  • by GrpA ( 691294 ) on Wednesday September 20, 2023 @06:02AM (#63862490)

    People who design data centers rarely understand what they are doing. Once you hit 32kW per rack, which is well and truly achievable in a modern data center with enclosed cold aisle and hot aisle plenums, which usually use water cooling.

    Problem is, once the pump goes out and the water stops, a data center like that has less than a minute before the temperature inside reaches the point that you can no longer escape. The research that identified the problem originally came from the US military, with respect to how long soldiers could remain inside a tank that was on fire, before escape was no longer possible. ( Somewhere around 72 degrees ).

    And you thought non-breathable gas extinguishers were bad...

    GrpA

  • they always ask the ai to solve the issue, though it may be constrained by all the kick backs corporate welfare and pork barrel.
  • Then lets not double the AC, lets switch to a DC design for data center racks.

    There are advantages to switching to DC:
    1) DC power distribution can be more energy-efficient
    2) Reduced Heat Generation in racks by eliminating the AC-DC conversion

    What is holding us back is the lack of a DC standard for this. We could use USB as at least a starting point as it supports negotiated 5V, 9V, 15V, and 20V. We know most of the power will be needed at 5V, but USB PD is not typically used for more than 15 wat
    • When talking about high input voltages, you gain virtually nothing going from AC to DC. The voltage drop across asynchronous input rectifiers accounts for maybe 1% power loss, and once that's done, you've got the exact same thing as before: a collection of synchronized, synchronous buck converters turning about 280Vdc into 1.x-to-12V dc.

      And you lose the ability of normal power control relays to reliably break contact because the current through a DC circuit never alternates across zero, so you require mu
      • Where did you get your 1% number? There is a reason the efficiency standard on computer servers is named "80-Plus" and power supplies have fans on them ... and I will give you a hint: it is not because they are 99% efficient. Roughly 15% of all of the heat in a data center comes from the power supplies. With a DC power bus you can do the conversion outside of the building.

        And who uses relays for DC? That is so 1950s AT&T. Today we have these things called Power MOSFETs and IGBTs, my thesis invol
        • I did not claim that power supplies are 99% efficient. I stated that you gain 1% for free with DC. This gain is because you lose very roughly 1% of 200V across the input rectifiers when converting AC to DC. At this point, that may no longer be the case if supplies are moving to synchronous input rectification in the pursuit of the ever elusive last few percent (under realistic constraints of component cost and size)... while the standard may originally have been "80+", the efficiency of computer power suppl
  • they spent 3 pages explaining why you cant draw more power than is available, what is this fluff?

"It's the best thing since professional golfers on 'ludes." -- Rick Obidiah

Working...