News

Oracle's Sudha Raghavan on AI's infrastructure renaissance

Name: IT Ops Query: Oracle's Sudha Raghavan on AI's infrastructure renaissance
Uploaded: 2025-11-06T09:50Z
Duration: 29 min 13 s

Kate Murray, Managing Editor

Some call it an AI boom, others call it an AI bubble. Either way, as AI demand continues to grow, so too does the demand on the infrastructure that supports it. And hyperscalers are doing all they can to keep up.

Earlier this year, Alphabet sold a 100-year bond as part of a $20 billion deal to finance its AI infrastructure expansion. Amazon announced in November 2025 plans to spend $50 billion to add 1.3 gigawatts of computing power to its federal AI infrastructure.

"The demand for AI infrastructure is at the peak, and it's growing every day," Sudha Raghavan, senior vice president at Oracle Cloud Infrastructure (OCI), said in a recent episode of IT Ops Query.

Research from consulting firm Deloitte estimated that, in the U.S. alone, power demand from AI data centers could reach 123 gigawatts in the next 10 years. That represents a 2,975% increase from 4 gigawatts in 2024.

"If you think about how cloud demand evolved, it did grow, but it didn't grow at this pace," Raghavan noted.

How AI is fueling infrastructure innovation

But while AI expansion is driving data center demand, it's also driving innovation in infrastructure, according to Raghavan, who runs Oracle's AI infrastructure platform.

The demand for AI infrastructure is at the peak, and it's growing every day.

Sudha RaghavanSenior vice president, Oracle Cloud Infrastructure

"We talk about all of these AI agents, the agentic infrastructure, the new models that are coming on, and so on and so forth," she said. "That demand for that different type of application, that different type of usage, is driving a lot of innovation at the infrastructure level."

Raghavan cited OCI Zettascale10 as one example of such innovation.

Oracle first introduced Zettascale in September 2024, a cloud computing cluster supporting just over 130,000 GPUs. Nearly a year later, at the Oracle AI World 2025 conference, the company announced Zettascale10, a supercluster of 800,000 GPUs it calls the "largest AI supercomputer in the cloud."

For comparison, the world's first exascale supercomputer, Frontier, performs at speeds of 1.1 exaFLOPS -- or 1.1 quintillion floating-point operations per second. Zettascale, according to OCI, achieved speeds of 2.4 zettaFLOPS, or 2.4 sextillion FLOPS. Now, Zettascale10 delivers 16 zettaFLOPS to tackle the demands of large-scale AI workloads.

As AI scales, so does data center construction

But all that scale calls for ever-larger data centers, which in turn create further demand for the resources that keep them running, such as power and water for cooling. These demands can strain the communities where data centers reside. In 2025, AWS withdrew its plans for a new data center development in Virginia after receiving community pushback.

"We're very particular about it," Raghavan said of OCI's effort to avoid harming the communities where it builds its data centers. She described how the company not only explores new ways of using the resources data centers run on, but it also examines the overall efficiency of that usage.

OCI isn't alone in that effort. Earlier this year, Microsoft announced its "Community-First AI Infrastructure" initiative, a framework providing five areas of commitment to residents of communities where the company's data centers are located. These commitments include efforts to prevent electricity rate hikes and protect water availability in the community.

Raghavan noted that, as data centers come online, "the demand for power and water is going to temporarily increase. And we are trying to make sure that is sustainable."

"What are the things that we need to line up from the rest of the industries, from the environment," she said, "so that we are successful in what we build as a true ecosystem player and not just the AI infrastructure company that we are?"

Watch this episode of IT Ops Query for more on AI infrastructure innovation, plus Raghavan's predictions about how AI scale and demand will evolve in the next year.

Kate Murray is a managing editor with Informa TechTarget's Infrastructure editorial team. She joined the company as an associate managing editor of e-products in 2020.

View All Videos

Transcript - Oracle's Sudha Raghavan on AI's infrastructure renaissance

Beth Pariseau: From Informa TechTarget, I'm Beth Pariseau, and this is IT Ops Query.

This podcast distills the signal from the noise about enterprise software development and platform engineering. Each week, we'll talk to expert guests about the latest tech industry news and trends that engineering and IT leaders need to know.

Don't forget to subscribe to IT Ops Query for more conversations on AI and the future of the enterprise digital workspace.

Sudha Raghavan is senior vice president at Oracle, responsible for running the AI infrastructure platform. In her role, she oversees all expansions and new buildouts for some of the world's largest infrastructure platforms for running GPU clusters, including network design.

In her previous role, Sudha was responsible for Oracle Cloud Infrastructure developer services and container platform, and she also managed Oracle's Kubernetes and serverless platforms.

With over 20 years of experience in software engineering, Sudha serves as a board member of the Cloud Native Computing Foundation and is passionate about culture and inclusion initiatives.

Prior to her time at Oracle, she worked at Microsoft for 14 years, including several years in Bing and Bing Ads.

Welcome, Sudha. Thank you for joining us today.

Sudha Raghavan: Thank you, Beth, for having me on.

Pariseau: So, let's start with the big picture. In general, AI is eating the world, and so how is that changing enterprise infrastructure purchasing decisions compared to previous generations of cloud workloads?

Raghavan: So, AI is everywhere. AI changes everything. This was our theme for Oracle AI World, which just concluded a couple of weeks ago. With AI, the expectation of throughput increase and efficiency increase in every business that you run in the world has multiplied.

However, AI also takes initial investment, which means you now have to plan for infrastructure for running your AI models and platforms and inferencing solutions before you get to see the benefit. So, companies -- enterprises, specifically -- are looking at longer-term benefits much more so than they would do for regular cloud purchasing, which was a commodity hardware, subscription purchases that they would sort of spike up and down their demands based on, you know, seasonality and stuff like that.

Here, AI, we are just beginning. So, everybody is at the peak of their demand to identify the various portions where their businesses can become more successful as AI gets integrated into their platforms. Because they are at the beginning, they are planning for their first peak.

That's what we, supplying infrastructure for these AI workloads, are seeing. That's why the demand is so high. We've barely started the innovation on what AI can do to real-world scenarios. Right now, there's lots of companies, four of the top five large training models for AI run in our cloud -- and 'our' as in Oracle Cloud Infrastructure -- and we are seeing the demand for using those models in inferencing growing endlessly.

People want to run not just small clusters to do very dedicated work, but also very large training clusters so they can see how AI can benefit their specific use case. Add the generality of a model like ChatGPT, but then add on your data to make it your solution -- not a generic ChatGPT solution -- and that is driving a lot of demand. ChatGPT is going on revving its versions, and then people who are adapting it to their data are also having to rev their versions of it. So, the demand for AI infrastructure is at the peak, and it's growing every day.

This is what we see when we think about building AI infrastructure for the next years to come, not the next three months, not the next six months, but with our RPO commitments, we at least are seeing five years. We don't even know in five years what AI would accomplish. We just know right now there's demand to see how much more benefit can be acquired by using AI.

Pariseau: Is that demand basically, right now, just for scale and more scale? I noticed that a lot of the news out of AI World was about just huge scale.

Raghavan: Exactly. So, computers, GPU clusters are not big enough is what we are hearing from a lot of these large AI model training companies. We just released 130,000-GPU clusters last year. And we're like, OK, this is going to sustain -- but zettascale, it's come in a year.

Pariseau: Yeah.

Raghavan: If you think about how cloud demand evolved, it did grow but didn't grow at this pace. The ask for what an AI GPU cluster can do -- and more importantly, how fast can it do it? I don't want to wait for a month to train my model. Because if it's wrong, I've just lost a month of actual productive work.

Pariseau: Right.

Raghavan: Can you do it faster? Can I try things faster? This is the demand that we are seeing in AI.

And what we are also seeing, at least as of last year, we saw a lot of people wanting training workloads. Now we're seeing inference workloads come through, too, which means people are using it. Whatever they have trained, they're finding it useful for their businesses and are coming back for inferencing infrastructure to actually see the benefit of the data that they have trained with.

So, I think this is a very nice cycle that we predict is going to go on at least for a few more.

Pariseau: Wow. So, we're already talking about gigawatt-scale supercomputers and multi-data-center clusters. But at the same time, costs and power requirements are a concern for all but the most well-funded companies. So, how can IT leaders address those worries?

Raghavan: So, we here at Oracle truly believe in making sure our power and cooling requirements are not harmful or do not cause any negative impact to the residential reliability or cost of the regions where we're building these large, gigawatt data centers, right? We're very particular about it.

We are also trying to make sure that our data centers use something like closed-loop, non-evaporative cooling to minimize water usage. We are building liquid cooling racks, GPU racks, more and more. So, not only is power a requirement, water also becomes a requirement. How do you conserve? We're always looking at newer ways of not just utilizing the resources that we need to run them but also efficiency of that.

For example, you know, AI workloads run with a lot of power variations, which means when they are running their workloads they need highest throughput, but then they also have their, you know, lull periods where they're not using. So now, if you run a gigawatt data center and you're swinging your power usage by hundreds of megawatts, our power grids cannot take that variance.

Pariseau: Right.

Raghavan: So, how do we adapt to make sure we are storing the excess energy in our data centers, our uninterrupted power supplies, UPS stores, and so on and so forth, so that when the peak happens and the grid is not able to supply, can we supply additional with what we have stored so we're not swinging the grid power?

We are thinking about end-to-end solutions to not increase more than what is required -- absolute bare minimum requirements for power, requirements for water, and so on and so forth.

Now this does mean, in the short term, as we build these data centers at least across the U.S. and definitely in other parts of the world, the demand for power and water are going to temporarily increase. And we are trying to make sure that is sustainable.

While, initially, we may see something unprecedented -- nobody builds a city to say, 'OK, I'm going to get a gigawatt data center tomorrow in this city.' So, initially, it's a big hump. We've already seen this in the places we have started building, but then over time, the amount of sustainable growth that these areas see because of these new data centers coming up feeds the economy much better than what we would have even predicted that it do.

We've gone to remote places all across the United States. We're trying to do the same in other countries and global economies as well, and we've seen how just the beginnings of a GPU data center is great for the local economy of that city, of that county, of that town. And how it's bringing both sustainable environment practices because they suddenly start to think, 'Oh, it's not a short-term thing, this is gonna live. I've gotta know how to run this for years to come.'

Pariseau: Right.

Raghavan: They're also figuring out, OK, if this gigawatt data center is gonna come, what about employment? How many people are going to be needed in those areas to sustain this again for years to come? It's not just the construction boom, but it's over years of time, sustainable economic growth.

We're seeing all this take place in the few areas we've started, and we only know we have just about begun this. Now we have, at Oracle Cloud, we now have almost templates of what is going to happen to a town or a city when we go and take our gigawatt data center and we're like, 'We're gonna build this here. What are the things that we need to line up for from the rest of the industries, from the environment, so that we are successful in what we build as a true ecosystem player and not just, you know, the AI infrastructure company that we are?'

Pariseau: Right. And in the meantime, I know that with Oracle Acceleron, there's also an eye toward efficiency in how networks operate.

Raghavan: Absolutely.

Pariseau: So, can you give us an overview of Acceleron and how it supports AI workloads efficiently?

Raghavan: Acceleron was built, again, for purpose. We know AI workloads run across the network. While we keep talking about GPU and how much power each GPU can have, the power of AI is in how the GPUs communicate with each other, is in the network, and this has been our specialty.

With Oracle, we've had engineered solutions for databases, like Exadata, where this need for running large volumes of data and computation across machines in the form of a cluster -- we used to call it Exadata -- but we have years of experience doing that and at high scale. And if you think about the database, automaticity is all that it needs. So, really, at high scale, high throughput and with data consistency not losing.

We've got this for several years at Oracle. When AI clusters came on board, we just had to pivot from, OK, it's not just about the database and the data, but it's also about computation and the training workloads that are needed. It is at a larger scale, but again using the same pillars of infrastructure, especially around network technology.

So, we had done RDMA and Rocky before this thing became a big deal. We've had that experience. Now we've just had to scale it. So, we did Rocky, which is RDMA over Converged Ethernet V2, very quickly. So, we have workloads that run at, you know, petabits and gigabits of network throughput just without even flinching. We don't think about them as these large things that you have to snowflake or hand-manage, it just runs. That's OCI. That's the power of our network.

And then, we also -- you know, AI, when it comes, it comes with a lot of conformance and regulation and security asks. We decided, we thought, OK. AI already -- there is so many other ways where privacy and data security and compliance are so important. What can we do to help AI developers not think about, at least, network security? At least the data that is being used to train in the network remains in the network? And out came ZPR. Right?

So, it's network security rules defined as policies. That's a very different way of thinking about the network because you always had -- developers have to think in sort of two different ways. OK, I control who has access to what resources in, sort of, identity/IAM policies in the cloud. And then my network engineers, who are not me, control what happens in the network -- network firewall, firewall rules. And there was a hope that these two sort of fit hand in glove. Sometimes they did, sometimes they didn't.

So, what did we do? We said, OK, how about we let the AI developers or the people maintaining AI infrastructure do this all together? You don't have to bring your networking counterparts to lay out firewall rules. How about we lay the network rules also as policies and enforce that in the network. That was ZPR. So that's kind of a very different, groundbreaking, almost paradigm of thinking about network security.

And that is also part of Acceleron. And then, you spoke about efficiency, you know? You saw, or hopefully a lot of your viewers also saw, our Oracle AI World Keynote by Clay, which spoke about this concept when we came up with bare-metal cloud, how we had two network cards: one that was fully for the customer, and one where we as OCI would run our stuff so that we can then, you know, clean and do the things, the firmware-level things, that we need to do without influencing what the customer -- without being able to look at what the customer did.

So, you had two NIC cards. That meant, you know, every NIC that you add is a hop in the network. Sure, it's within that same zip, but it is a hop in the network. It adds even a few nanoseconds, it still adds latency. That's a little bit of performance, right? It gives you security, but there is a cost to it.

When we came up with Converged NIC, that gave us the best of both worlds. Again, trying to do the perfect thing for the workloads without compromising on security. Security has been a foundational design principle for everything we have built at OCI. So, we will never, ever let go of security for the need for some features. So, we're always trying to figure out how to get the best of both worlds, and that is our Converged NIC strategy, which is also part of Acceleron.

So, in the end, we're trying to do things more securely and more efficiently with the best customer experience. That's what OCI Acceleron is all about.

Pariseau: And there's also an efficiency aspect to Rocky in general, right, where you're directly communicating between CPUs?

Raghavan: Correct, we are. So, RDMA, right? Remote access, you're accessing the CPU memory of another compute tray or computer in the network like it is attached to your own CPU. That is RDMA over Converged Ethernet. So, we've been -- again, that is a technology that we had used in other systems, which we are now expanding.

I think one of the other things that we did in that same technology is talk about multiplanar networks. A lot of our competitors use multi-tier networking. So, you know, if you look at a computer and you look at it in a data center, these computers belong to what we call racks. The racks all have a top-of-the-rack switch, which is then used for communication outside. Now, the top-of-the-rack switch technically is a single-turn device or a single point of failure. That the thing dies, then the whole rack, pretty much, is unusable, right?

Now, what we did was we did multiplanar networks, which means you have not only redundancy in the car -- most companies do have multiple towers -- but then we also said they both belong to different planes. So, if one plane breaks because something, some switch, some network utility, some rack, some spine broke down, you will have another path to get to that other computer. It might be a little bit slower because you have definitely reduced the bandwidth, you've reduced one plane of the end planes that it operates on, but the workload will continue.

We've seen AI workloads are way more sensitive to stoppage than they are to, you know, 'Keep going but now at a slower pace.' The workload will react to 'OK, I don't have as much throughput,' and we have several quality-of-service protocols where these things will adjust its workload depending on what throughput the network is able to provide.

But then if you shut it down completely, the entire workload has to checkpoint and wait for the thing to come back up and restart. It has to reingest all the checkpoint, that's way more expensive for large AI workloads.

Pariseau: Yeah.

Raghavan: And so we did this thing called multiplanar over Rocky, which again was something that we knew for several years, how to operate on.

So, again, a lot of innovation coming in the infrastructure space. We know we talk about all of these AI agents, the agentic infrastructure, the new models that are coming on and so on and so forth, but that also drives a lot of innovation at the infrastructure level. That demand for that different type of application, different type of usage, is driving a lot of innovation at the infrastructure level, and we at Oracle are very proud to be in the forefront of that innovation.

Pariseau: Yep, what's old is new again, for sure, with infrastructure.

Raghavan: Absolutely.

Pariseau: So, on that topic, what is zettascale? And what's new with Zettascale10 and the AMD zettascale supercluster that were announced today at AI World?

Raghavan: Absolutely, yeah. AI, it's all about scale. You've spoken about it so many times, right? So, just spoke about zettascale, launched less than a year ago, and now comes Zettascale10. As the name sounds, it's 10x -- not quite 10x, but almost there -- with the number of GPUs that are expected to run as one cluster. That is Zettascale10, pretty much.

The number of machines that need to interact to make a thing -- one job, one training workload, one new AI model is now requiring 8x, the largest supercluster that we have today in the world. That's the way to think about it.

And that's what Zettascale10 is allowing us to talk about. And you know, just, these numbers may sound very high, but this is the truth of what we are building today. It's not tomorrow. One data center with 1.5 gigawatts. This entire data center in 1.5-kilometer off-campus radius, 16 zettaFLOPS. People didn't even know what a zettaFLOP was, so we're talking about 16 zettaFLOPS. Eight hundred thousand GPUs in one cluster. And with Oracle's Acceleron, we are talking about exabits per second of RDMA throughput for our back end. And with latency guarantees of less than 10 microseconds.

All of these numbers have not been heard of together. Yes, a network can give you 10 microseconds of latency. Yes, you can get 1.5 gigawatts of data center in a much larger space. Not the whole thing working as one supercluster. That's what Zettascale10 is.

And to tell you that this is not something we are launching in future, but that we are already working on it, is truly something that we at Oracle are super proud of. We're already building the stuff that we are announcing to the world as part of AI World, but we are doing it already.

Pariseau: Wow. So, it's nearly impossible to make any kind of prediction anymore in this space, but I'm gonna ask anyway. Does this just continue with needing more and more scale? Do things start to take better advantage of lower-powered hardware? How do you see this evolving, maybe over the next year, say?

Raghavan: So, look. Like I said, the AI workloads are empowering innovation in infrastructure. Power, cooling. Oh, I did not talk -- part of your question was also on the AMD side. We are trying, as a cloud service provider, to bring in more players into the ecosystem. The demand is so high that one Nvidia cannot suffice that demand.

And every chip provider has their own unique view into what kind of workloads their chips are good at. And there are enough different variations in AI workloads that it is OK for multiple chip providers to come into the same pool of demand. This is why we launched the AMD 50,000-GPU supercluster, which was also announced, again, first of a kind for AMD as a provider.

Right now, you are asking about 'Hey, what is going to happen in the next year?' For sure, Zettascale10 is going to be a reality. And more importantly, we're gonna try and build more of these kinds of superclusters in the rest of the world, and in the United States, more efficiently and a lot more templatized.

You know, when you build something for the first time, everybody is learning -- from the construction people, how to do this dense computer, dense space, to your power suppliers, to your cooling suppliers, to the chips -- everybody is learning how to build things at this scale. When you go and do a two and a three, supply chain optimizes very, very quickly. Especially when you're building at this scale.

So, I am envisioning for us a lot of supply chain efficiencies, optimizations, construction, infrastructure optimizations -- and then firmware, sort of, network optimizations on top. All of this leading to the highest level of AI workloads being run more efficiently, thereby being more sustainable.

That's -- my view of things is the demand. We are seeing demand. And the demand with our RPO is going to exist for more than one year, for sure.

Now, as we supply -- or as we close the gap between supply and demand, but still demand is far outpacing supply -- as we close that gap, we are also hoping the people that are doing these demands get more efficient at running their workloads, their final business outcomes, more efficiently. Everybody is talking about it.

And so, naturally, we can now provide to more variety of demand and not just be with the five top companies that do these AI training models and that use, you know, hundreds and thousands of GPUs all at the same time. We can now go for a wider variety because these folks will now know how to use their infra that they have a little more efficiently. This is our hope.

We are also seeing no reduction in demand from anywhere in the world, not just U.S., and that's also very fascinating. While U.S. is at the forefront, we are seeing demand from Europe, from APAC, almost equally for this AI infra. Again, I spoke a little bit about security and compliance. All of these different nations want AI but in their own country. Same data compliance and data residency requirements require you to build the same AI infrastructure locally. Again, yet another type of scale.

What works in the United States may not work physically in another country. How do you do this variance? How do you build at the same pace, even, across those different boundaries? What are the optimizations you can do? Again, innovation that is waiting to be discovered because of the demand and the scale at which this demand is going.

Pariseau: Yeah, there's a lot for you to think about. I appreciate you sharing some of your insights from that work with us, and thank you very much for taking the time out.

Raghavan: Thank you again for having me, and I hope you have a good rest of your day.

Pariseau: You too. Thank you.

Thank you for tuning in to IT Ops Query. To learn more about enterprise software development and platform engineering, explore our content on Informa TechTarget sites. Find us on YouTube at our channel, Eye on Tech. Subscribe to our podcast to receive the latest episodes as they drop. And if you liked what you heard today, give us a rating and review on Apple, Spotify or wherever you're listening. Thank you for joining us.

+ Show Transcript