Oracle TV CloudWorld 2024: Scaling Up Your Cloud with Oracle Superclusters

Published: Sep 11, 2024 Duration: 00:23:14 Category: Science & Technology

Trending searches: oracle
thank you shaa and now to dive into the concept of scaling out with the Oracle Cloud we are here with Mahesh the agaran Executive Vice President of Oracle Cloud infrastructure how are you Mahesh I'm doing great good it's great to you here yeah we're so we're super excited to have you here it was a it was a great keynote by uh by Clay and you know we're talking a lot about this idea of scalability right scaling out scaling up scaling down and we are here to talk with you about scaling up with super clusters are you are you in for that I'm excited okay well we're also going to talk about Security in a minute but I just I wanted to interrupt because last year you and I had a little chat about security that's right right and now it's the topic it is the topic of is I think I was just seating you with a bunch of ideas and Larry talked about it this year that's right that's right um you know you're ahead of your time is what I'm trying to say exactly well okay let's get back to Super clusters um and I kind of just wanted to start from a super high level get it I was so proud of myself I wrote that question having to pretend to laugh at I know I did that just now oh yeah I'm I'm used to it don't worry what what is a super cluster look I think uh when you think about um The Innovation that's happening in AI right be it at the hardware level or the data center level um what oracle's trying to do is to bring all of that technology so the AI Innovation happens at a much rapid Pace well how do you do something at a much rapid Pace well what's something that takes like let's say for example a training job that takes um 1,000 gpus and it takes about 3 months to do the work right what can we do to make that faster and what Oracle is is doing is bringing this concept of super clusters to the next level and what we're announcing today is this notion of 130,000 plus super clusters so these jobs that take so long to do right because when you're training a job it takes months to complete right with the power of a super cluster which is thousands of GPU hundreds of thousand gpus interconnected to reduce the time it takes to train and really realize value for our customers so super cluster is hundreds of thousands of gpus put together clustered maximum performance so at the end of the day customers can realize value faster love that I want to understand because when so Kendall and I went to the Salt Lake City Data Center we were in that super cluster right and you know one of the things we were talking to Clay about is you know he would say well we've been building clusters for a long time right exad data right so to build a super cluster though is kind of lots of clusters together mhm but you know you say that and it sounds like well we just connect them like Legos how how hard does it get when you go from what did we see 23,000 when we were there maybe it was even more than that so now you're talking about 131,072 gpus so what what is the is it is it linear or exponential the no there is a lot of complexity when it comes to actually powering these things together and and you know you saw that in Salt Lake City that's only about 23,000 yeah right so only 23,000 and that's that's a day of the past um the the hard engineering challenges first comes from uh the networking Dimension right when you think about um 131,000 working together in unison even if one or two gpus are slow the whole job J slows down so when you think about operating these data you know these gpus together your network has to be extremely performant so to give you an idea we're going to have up to 104 PAB bits per second of network throughput for these gpus to talk to each other and actually operate that them Extreme Performance so customers can realize every single ounce of capability that exists on these gpus right and so that's really hard the Second Challenge comes from cooling them right right now you have these gpus that are running so hot I mean you there's so much noise in Sol city right yeah yeah uh you got to cool them so you know Nvidia and us are partnering together on you know the liquid cooling technologies that allows us to cool these gpus so they can actually and run at Peak Performance that's the second challenge the third um know prly comes from managing them repairing them at scale right so you know I I I joke this uh with with some customers when humans walk around GPU clusters performance drops so when you walk near at Kendall I was very worried about my performance clicking and clacking in my but the good news is you walked around clusters that were not alive there you go but um but we actually manage this like they're extremely finicky objects because every single bit of performance matters um and and so there's a lot of challenges that comes with building these clusters for sure I don't what the next time I visit when we have liquid cooling I can we put color in there so it's like blue no no that is literally how performance drops I'm not even going to let you near the cluster what are you talking about color I used to build my own computers and you had you could add the colors to the no not the same thing no please never never don't do that don't don't do that all right don't let me anywhere near the DAT not just stay here yeah you're good here but so why the need now what why you know what are the some of the customer challenges that we're concerned about that we would you know superclusters at all let alone 131,000 look I think if you think about the Innovation that is happening in the um AI or model or llm industry right there's tons of new innovation that's happening for those models one two I think from a training perspective the industry is moving onto pictures videos things are extremely large scale gone from text now to all these Advanced things and obviously we're also going into the code assist and developer portfolio as well so things are getting more powerful more Innovation on the models means more power from the hardware more data centers more Network more storage and then it all has to come together because it only takes one GPU to be bad for this whole 130,000 cluster to not do well right right and then you got agents now oh yeah so look I'm I'm talking about the infrastructure oh man like look I'm I'm I'm the engine guy I build the engines for the cars but people talk about the cool things on so I'll tell you what this does is it allows these models to be trained and these large infancy customer large training customers love that yeah but then these llms Now power our generative AI Services right and Enterprise customers now give these generative AI services so they can integrate them into their apps then you go one level up all of the announcements with intelligent agents and everything that Steve Miranda announced today morning that now is powered by the work that we're doing underneath right now our Fusion apps get better your net speed apps get better your emergency response applications in the industry vertical app apps get better risk management in you know our our Risk Solutions from industry vertical apps and project management all of that gets better so when a customer you're if you're at the infrastructure level innovating on llm you get tremendous benefit you're an Enterprise looking to Leverage The Elements customize it you have products and you see value so you can realize the value you go all the way up to our industry apps and our Fusion apps right there's value across the board yeah right absolutely let let me let me ask one more question about the building of these super clusters and you talked about the networking and the cooling uh if we if we jump over to a little bit more on the networking side sort of the secret sauce behind the super clusters can you talk about how it's architected that way oh for sure um you know as we've been Pioneers in the uh RDMA technology for a very long time because that's something that ex actually has had for a long time so we're actually one of the early Pioneers or you know technology companies that B on the RDMA technology we've brought that into our database to give it turbocharged it to be what exit data is today powering the world um we took the benefits of doing RDMA we brought that on to our high performance Computing environment and I think we did that 6 years ago and now you fast forward down to like uh what we're doing with these gpus and these AI training models we've been doing this for a long time so for our ability to like keep scaling this up bringing the best giving the best performance a lot of hard work yeah but uh something that we've been doing for decades so it was not new for us it was just like carry on the advantage and make sure customers realize that value of the hardware and you can't do that without the networking y yep um you've kind of already hit on this but I want to we have time so I want to dive in a little bit deeper what advantages do super clusters offer over traditional deployment architectures or customer workloads yeah I think uh it it fundamentally comes down to that um um performance of them working in unison right so think of this as you know um you take you take a piece of work and you say look we have to complete this task and you give it to 100 gpus they all go and say I'm done with my part let's all come together and say we're done with our work let's pick the next job but imagine you know if these things are not performant and the last GPU is still doing its work the others are waiting right like all right I'm paying I'm using the power if they don't come back in unison that actually becomes a problem and that's where you know a lot of the Innovation that we're doing is to actually eek every bit of power and um you know use them effectively yeah so what is Oracle diff differentiator in this space you know but in comparison to other Cloud providers uh it's it's on multiple Dimensions one is the scale of it um second is definitely our networking technology or ability to offer that Peak Performance right third is it's not just the computer in the network the storage has to go hand inand right all of these computers are hitting a storage server to get the next bit of work they're getting the layest bit of data and then once they're done with the data they need to write it back imagine 131 th000 gpus doing all of this work and going and putting that back so we are offering Advanced storage capabilities fourth is that when you have such a large thing operating the ability to maintain and manage them is hard right you need to be able to observe the power utilization you need to observe how the GPU is performing you need to observe uh the storage performance you need to make sure how the clients are behaving you need to make sure this the workers are picking up the work at the right time so there's a tons of observable work that we're doing so if you really think about our differentiators it's predominantly on the network it's and our ability to offer high performance storage cluster scale and lastly it's also an ability to execute right when you think about you know a customer coming in and saying hey I need to train my model I got about four months I need to like realize this value Oracle has the ability to deliver 10,000 gpus every 10 days from the time power available wow right so and that requires tremendous amount of optimization we use our own softwares for project management that our industry ver collapse and fusion and you know we use that for preparing our project schedules but we we also execute which is a secret sauce which is invisible uh easy to say but to light up 10,000 gpus and walk away from the data center in 10 days uh you know is something that we do as well yeah yeah and then when you get to 131,000 sorry 131,00 20 131,072 72 okay was close I was just I added a zero let's let's talk about security but actually let's broaden it out because I think sometimes this notion of sovereignty gets gets uh comingled with security um data residency uh privacy these are each different they are separate pillars um but let's start with security and specifically sticking with super clusters for a moment um what security features are embedded into oci to protect sensitive data and applications look I think you know we've had a a built-in security model Security First approach you know Larry talks about it every single time there's nothing that goes uh out of oci that doesn't actually have thorough Security reviews and deep testing uh from a security perspective um but it goes back to the bread and butter and how we started we started out with building bare metal computers right and we're the only Cloud today that offers bare metal gpus right what that does is it enables customers to get them get the data operate on them from a security perspective also at the highest performance so our bare metal computers benefit our AI customers then you step back to everything that oci does from our physical security all of the products authentication authorization all of the advantages around Cloud guard they're all inbuilt into oci as a customer you don't have to do if you're an AI customer doesn't does matter you get all of those features for free it's inbuilt um another cool thing that we do one of the um things that customers do when they buy these large clusters right th 3,000 10,000 they got their research team in a corner they got their production data they got their research team they're not building separate clusters they're building a giant cluster because they want the flexibility to take those gpus if they want to finish a job faster and say researchers wait for a week we want this this job to run what Oracle does is that we have the ability to isolate clusters that are in you know connected together so you can say hey I have a 3,000 GPU cluster but for a week I want 256 gpus together in in unison and I want the rest in a different environment I want the security of that to be different from this those are sort of inbuilt capabilities we don't talk about it a lot but it's sort of inbuilt into our RDMA Network philosophy in terms of isolation and Primitives in security so that's something we do so everything that we do in on Oracle and oci is available for AI there's a couple of special secret sauces we got as well yeah what what else are we doing to support customers who have specific sovereignty requirements or other super complex uh regulations that they have to comply with look I think it it starts with our fundamental strategy around uh our deployment choices right like we we we we tell people we have and we have customers who operate on our public Cloud our government cloud or Sovereign cloud or isolated regions or dedicated regions or alloy these are essentially various deployment choices so as a customer they can get Oracle technology whatever they are right so when it comes to sovereignty our philosophy is um we know data is distributed it's in multicloud it's in Oracle Cloud it's on premises on or exad data could be in another place we believe that AI should sit where data sits yeah right so with all of our deployment choices what we do is we bring all of that and we bring our AI infrastructure and AI clustering technology to our Sovereign customers as well right so two examples of customers who are very excited and uh they are uh super excited to bring this Innovation is NRI NRI is actually building out uh we're building out AI clusters for NRI in their drcc and alloy regions wow and ettis salot is another big customer of ours uh who are also were're building out GP GPU clusters for them in their dedicated region and with NRI they're interested in building a custom llm model for themselves with Ed is a lot they're interested in using our generative AI Services integrating with their Enterprise apps so everything that we talked about value is not just in our public Cloud it is for all of our customers all of our deployment Choice customers they get AI too wow yeah I was just talking with just a few minutes ago with paler and about you know their choice of oci kind of the same thing common set of functionality no matter the deployment model that's right so you can count on that whether you're an alloy or you're running in a public Cloud that has you know 20,000 you know a super cluster of 20,000 right views so um let's let's talk about some of these examples and how it's allowing them to serve their customer base I mean NRI is a good one right they their our customer's customer is the financial institutions of Japan that's correct right and they and so they're building an AI platform for of of SAS applications that's correct for their customers right correct that is it's literally I mean it goes back down to when you think about the models what NRI is trying to do right NRI is a leading um application provider and and an integrator and an infrastructure player in the Japanese market and what they're trying and their stock exchang is running on top of right NRI so being able to serve a wide variety of Enterprise apps Financial vertical apps um and bringing AI to them it's pretty cool right and and or Oracle is an innovator and a partner that allows NRI to thrive in the Japanese market and and you know and something that saffro says um we like to be in the in in in in the shadow right we want our customers to win exact we're always in the shadow so we know that a significant chunk of our 1400 14,000 companies run on Oracle technology just like you know um and we bring the technology to them and we want to be in the shadow and we want we want their customers and them to be successful and that's what we do with alloy I mean I think we overused the term ecosystem but but that ability to allow that ecosystem to expand behind the scenes you our technology I do want to touch on um something Larry talked about on stage which is kind of the next wave of cloud security that's right and he talks specifically about um zipper zero trust packet routing um and we didn't talk about that last year you and I I don't we did we did but I don't think we talked about it we didn't name it we didn't name it what we did is we announced that concept right right because it in in the Inception idea of hey what if what if we actually changed the Paradigm around Network and data security why should the fundamental challenge there's two problems right if you think about how internet protocols evolved over time right they had a vision of saying look the internet is supposed and focused on Plumbing and all of the security and encryption goes all the way up at the top and I'll do some Ackles at the bottom right but why is it that way nobody challenged it for decades and Larry Larry was like hey why should it be that way and and we partnered with applied invention and we basically said what if we actually do security at the network layer while allowing applications to evolve but the network still understands the fact that you told it to not allow Communications to happen a certain way so the simplest example and and you know people I I want to give an example right imagine you have a computer and you said I only want to talk on this in this language which is say Port Port 80 right that's what you did on day one you allowed more application changes happen things were evolving and then over time a network engineer went and said I want to open up another Port but now your computer and everything that you did is open on the internet yes right but with zipper what you can say at the start is saying look go to open port ad whatever but you tell our our compiler and our policy engine or network controller that computer can never talk to the other network in my own FR that computer should never be able to talk the internet you can do any changes inside the intent that you described as part of the policy is BU deep inside our Network and it knows that you cannot allow it to talk so even if someone opened a port it'll start traversing and be like well I'm going to this other part of the network sorry it was told to me that I'm not supposed to allow it so even if you open a port or any changes in your network the cloud protects you with the super technology right just just to probe further a little bit on that again for anybody maybe who missed Larry's keynote but or or hasn't really heard much about this what um why not why are we challenging this right now like you said this has been this way for decades why was Larry like let's do this now so partially I think it comes from um us innovating further on our offbox Network virtualization on our network controller technology right so if you go back to when we launched bare metal we separated what is a physical computer wave with our software running outside we had that and we've been doing tons of work in that area to like make it even better and then we said look we've got a we've got the smart network controller that's sitting outside in every box in our Cloud what could we further what could we make better and if you were to like go one level up on security what would that idea be And So It Started from taking advantage of our network controller our offbox computer and saying how do we go to the next level and and partially that also comes from the fact that we're seeing tons of new types of attacks outside right and so and and as that continuously Rises outside you go we've got to change the game like whatever we do today if it was sufficient we wouldn't be seeing this so that's where we're no we got to double down even more how do we take it to the next level right well speaking of taking it to the next level I mean there's been some again amazing innovations that we've been talking about all week um but what is next for oci what can our viewers customers Partners um expect expect over the next let's call it a year or so I look I think it's there's going to be Innovation all around uh but the one thing that I'm most um excited about is that we have uh 162 data centers and Counting yeah and I think uh you know L's vision of having every Enterprise every customer every school every hospital having a data center we're on the way that's what I think the what should expect from o wow that's awesome thank you as always for joining us here on Oracle TV mahash thank you so much it was a pleasure thank Youk you

Share your thoughts

Related Transcripts

Oracle TV CloudWorld 2024: Solving Industry Challenges with Applied AI thumbnail
Oracle TV CloudWorld 2024: Solving Industry Challenges with Applied AI

Category: Science & Technology

Thank you mundy uh we just saw her on the keynote stage so so i think she might have sprinted over here it's stephanie trunzo senior vice president of oracle industries great job up there thank you running in heels is is also a lot of fun so i did run over here yeah you didn't you didn't break anything... Read more

Oracle CloudWorld 2024: Conference Highlights thumbnail
Oracle CloudWorld 2024: Conference Highlights

Category: Science & Technology

[music] we are here at cloud world 2024 this event is a mix of strategy innovation networking it's like disney world for technologists my favorite moment was a k note with larry ellison because they they announced partnership with aws just seeing his synergy on stage seeing his synergy and the excitement... Read more

Goodbye Hayley Mills. TODAY! 3 P.M Everyone said a tearful goodbye to the Hayley Mills thumbnail
Goodbye Hayley Mills. TODAY! 3 P.M Everyone said a tearful goodbye to the Hayley Mills

Category: Sports

Haley mills born on april 18th 1946 in london england is an acclaimed actress who gained international fame as a child star in the 1960s the daughter of renowned actor john mills and playwright mary haley bell mills was exposed to the world of acting from a young age her career took off when she was... Read more

Liam Neeson... Rest in Peace, Best Actor Film and Television Acto thumbnail
Liam Neeson... Rest in Peace, Best Actor Film and Television Acto

Category: Sports

Liam niss born on june 7th 1952 in bal mina northern ireland is a renowned actor known for his powerful performances and commanding presence on screen nissan's acting career began in the theater where he trained at the belfast lyric players theater and honed his craft his film debut came in 1978 with... Read more

Odisha Government Extends Journalists' Health Insurance Scheme Till August 2025 thumbnail
Odisha Government Extends Journalists' Health Insurance Scheme Till August 2025

Category: News & Politics

उड़ीसा सरकार स्वाना सेबर 3 अग 205 कर्म सादिक स्वास्थ जोजना न िक प्रीमियम बाद को 3 रस कंपनी कोमा जोना मा ी सरकार सा को स्वा जोजना सेबर 3 अग 20 साद स्वाना िक प्रीमियम बा कनीना अना लो मा प्र स सरकार को स्वा जोजना स्वाना िक प्रीमियम बा को र कंपनी कोमा जोना एक अत लो मा [संगीत] सरकार स्वाना 3 अग 20 प सादिक स्वास्थ जोना न िक प्रीमियम बा को र कंपनी कोमाना सचना लोप पल मास प्रथम स [संगीत] Read more

Wafflin Tuesday turns into more of a roast of Ellum | ft Zelina Vega, Malakai Black thumbnail
Wafflin Tuesday turns into more of a roast of Ellum | ft Zelina Vega, Malakai Black

Category: Entertainment

Hello hello hello much better wait should i hold it this way or this way uh the one whichever whichever you would rather do but actually you know just one way we all like straight up you know there there you go much better that's that's perfect that is that is perfect perfect easy we're joined by the... Read more

Oracle's Big Year: Ellison Closes in on Bezos! #shorts thumbnail
Oracle's Big Year: Ellison Closes in on Bezos! #shorts

Category: Science & Technology

If you don't like falling behind but still want to innovate then take a queue from oracle's playbook oracle is absolutely crushing it this year while intel and cisco are struggling to keep up oracle's stock has skyrocketed by 49% second only to nvidia among large cap tech stocks and the man behind this... Read more

Sitaram Yechury Was The Strongest Voice In Favor Of Democracy, Secularism & United Progress Of India thumbnail
Sitaram Yechury Was The Strongest Voice In Favor Of Democracy, Secularism & United Progress Of India

Category: News & Politics

It is a very very sad affairs for the communist party of india marxist for the left movement of india and the democratic movement of the country he was the strongest voice in favor of democracy in favor of secularism in favor of unity and united um progress of this country and he was uh he has serious... Read more

Special Story: Horseshoe crabs on verge of extinction! thumbnail
Special Story: Horseshoe crabs on verge of extinction!

Category: News & Politics

[संगीत] एक बर प्रजाति जी रक्त ककड़ बद बजार मल र र किंतु भारत ज मूलन दशक प्रजाति बल 4 मिलियन ष पथ सट मुने कें ने मा बा मार्केट बा मार्केट तमान सम म जा प र म प्रजा ीरे धरे क समने सामाजिक कर्मी र काने प्रतिन सं सं अमेरिका नेने द कर प्रजा र जो की मधुर पानी और मि बहुत सहाय ब नने परा नहीं फिशमन अनरेगुलेटेड फिरम जो सेने मात्र जाने प्रथम का द का प फ प फ सरफेस स्मूथ जा नहीं ज ंग जा ता गु मात्र मोटी हो जोना बहुत म रिपोर्ट भी कर अंतर... Read more

21-Year-Old Boy From Odisha’s Mayurbhanj Helping Kamala Harris In US Presidential Elections thumbnail
21-Year-Old Boy From Odisha’s Mayurbhanj Helping Kamala Harris In US Presidential Elections

Category: News & Politics

[music] new hampshire in america seems a long long way from aisa's mayan however aia boy samu from the tribal dominated district has psychologically reduced the distance between odsa and america by coming to limelight with his active participation in the ongoing presidential [music] election the 21-year-old... Read more

Osama Bin Laden's son Hamza Bin Laden Alive, Running Al-Qaeda Network From Afghanistan thumbnail
Osama Bin Laden's son Hamza Bin Laden Alive, Running Al-Qaeda Network From Afghanistan

Category: News & Politics

मरी जिंदा है जलाद ओमा बिन लाडन हमजा बिन लाडन गुप्त बापा अलका ने हमजा बिन लाडन सामना रिपोर्ट ही खुलासा हमजा भाई अब्दुल्ला बिन लाडन को साथी कर अफगानिस्तान अलकायदा नेटवर्क विस्तार कर बो गोदा रिपोर्ट ज तालिबान विरोधी मिलिटरी संगठन नेशनल मोबिलाइजेशन फ्रंट एमफ मस् हमप रिपोर्ट प्रकाश कर अमेरिका करर स्क हमजा को एमफ रिपोर्ट खंडन करन अल जवारी अल ब आर कर हमजा जवारी स से म का म खबर को सता रिपोर्ट मरी जिंदा है जमा बिन लान हमजा बिन लाडन गुप्त... Read more