Training the largest LLMs, Cerebras Wafer-Scale Architecture | Keynote 3 | Jean-Philippe Fricker

Published: May 06, 2024 Duration: 00:31:45 Category: Science & Technology

Tags : ai Ml Artificial Intelligence machine learning AMLD Applied Machine Learning Days EPFL AMLD EPFL AMLDEPFL22

Trending searches: cerebras

[Music] thank you thank you so much and thank you for attending this session I know you had uh maybe four days of uh hard labor uh I'm going to try to show you how your labor could be somewhat simplified with what we've done at cerebras um so yes my name is Jean Philip Freer and as you know when you go to the US you need to shorten things so I go by JP um they even dropped my last name uh but I actually went to school at the epfl so I'm happy to be this in this venue and thank you so much for giving me the opportunity to talk to you tonight um cerebral systems who are we so we founded cerebras with uh four uh co-founders in 2016 and we went on to building and deploying a new class of computer systems and designed for the purpose of accelerating Ai and changing the work of AI so you'll see how we do that I'm going to explain a little bit what we had as a challenges but since 2016 I can tell you we grew a little bit we are now about 350 people we have offices in Silicon Valley in San Diego in Toronto and in Tokyo uh and we now have customers uh across North America Asia and Europe so it's been quite an adventure uh about 5 years ago we saw something like this you probably saw this graph many times but it's actually a key one to understand is that there is an unsatiable demand and unprecedent unprecedented demand uh for more compute and so in 5 years we saw that we needed 40,000 times more compute you cannot do that with just basically following Mo's law and the traditional design method that one would use to design chips so we had to embark on something different and then you see this trend you feel like well how we're going to solve the next five years well we believe we are on a good good path for that and let me show you a little bit of this so we actually made a very big bet we said why do you want to uh make gpus on a large silicon wafer and then cut them and then package them and put them in a system interconnect those system together and take all the overhead that you get with it and the hard uh problem of communicating between them we said no keep them on the same wafer and put about 4 trillion transistors on it it's about 42,000 Square mm and yes it is this size we don't cut it it's one single processor that has actually many cores we have 900,000 cores on it and there optimized for sparse linear algebra it's made with tsmc on a 5 nanometer uh process it has about 125 P petaflop it's hard to say anyway of AI compute uh capacity has a lot of memory it had 21 petabyte per second of memory bandwidth this is beyond anything that's done before uh and it has a very very high um uh speed um and low latency on chip fabric of 214 pabit per second so this is the advantage right you take all the cores many of them almost a million and you connect them with very short interconnection on the same chip you don't have to go in and out of one chip so let's look at the process how we made this happen first of all the first question that everyone asks well you have a large wafer it's never perfect you have defects how do you deal with that well it's impossible to yield a full wafer with zero defects everyone knows that so today what you do you test the different chips the one that are bad you mark them you cut them you throw them away the one that are not so good you bend them and you sell them at a lower price well what we did was different um we actually understood where the defects were and we created an architecture that can deal with those defects we made it um agnostic to those defects by making it programmable such that if there is a defect you can work around it so you have enough bandwidth on the chip you have enough connectivity between the cores such that there is connectivity or a core that doesn't work you just go around it so you need to be able to detect this while you're uh bringing up a wafer you can test it you can detect that you can configure the wafer so that it goes around and then you can expose to the software layers A pristine wafer that had no defect you do the same thing in memory in your uh SRAM or Dam today on your computers that's what's happening even in your cell phone The Flash in there is using a similar concept we applied it to compute as well now how do we make that because the industry has been used to the usual process of um repeating the same uh chip on a wafer and then uh cutting the chips uh and then packaging them separately where we had to convince tsmc that we could actually do something different with different with them is to create uh an interconnection between the various um radicals um that are lithographically made on a on a different on a specific wafer and to convince them that we could actually have interconnections between those radical and this is technologically not that hard but from a process perspective meaning the logistics of getting the chip made was pretty hard um and we had to C codesign this with their Fab the engineers in their Fab to work closely with our Engineers that was not the entire problem once you have a very large chip the problem is that the chip itself made of silicon has a different coefficient of thermal expansion um than the rest of the package the rest of the package is made of typically a printed circuit board that has a low coefficient a high sorry coefficient of thermal expansion while the C has a very low one so if you solder those together or if you Bond them everything cracks and I can tell you we crack many Wafers we had a lot of mosaics in our Labs with Wafers that were completely broken and for that we had to invent uh to compensate for that we had to invent a connector and that connectors allows us to not only transfer interconnections signals uh to communicate with the wafer but also to deliver power and this connector we had to invent it no one had it um and we um made it such that it can compensate for the um thermal expansion of the two devices while maintain connectivity so this is very hard to do in addition it is made harder because we need to carry a lot of power to the chip um it's about 20 kilowatt to one chip for who knows what that is it's usually more than your house uh worth of energy that goes into a a single chip and so to power it we had to deliver power in a very different way today on your typical CPU or GPU on your servers all the power converters are surrounding the chip they can do that they have space and then the printed circuit board brings the power basically the current to the chip itself because the chip is small you can reach the center of the chip without too much problem in our case the chip being so large we could not go from the sides and reach the center so we had to move all those power converters from the surrounding of the chip to the bottom of it and go to the other side so um I'm talking about um Power and cooling so this is the on the picture you see here the bottom section of this picture you can see how we had to fix the the power delivery on the bottom because today it is done by the side for the airflow is the same problem for the airflow you cannot just blow something on the top across the entire chip by the time it reaches the other end of the chip it would be too hot so you cannot do that with air you cannot do that with with water so what we did we Ed this third dimension so we moved the power converter from the surrounding of the chip to the bottom of it and we move the air flow or water cooling from a laminar or I would say side by side crossing of the chip to something that is more using um or leveraging the third dimension both for power delivery and for cooling we use that third dimension this was not trivial no one had this available no had a power converter that had to be now much denser to provide the current that we needed so for who knows about the the power here I told you about 20 Kow per wafer about 1 volt power uh voltage to the wafer slightly less it's about 44,000 amps that you need to carry through that connector 22,000 this way 2,000 down this is unbelievably high density um and no one had uh the capacity to do that so we had to invent everything took us a while about three years to get that done uh but we actually got it made next um well you can fit a very large problem on a wafer like this with about a million cores and a lot of memory on it you can you can fit quite a bit however you need to be able to scale and at some point the size of the problem doesn't fit on the chip anymore so we had to create a cluster and to create a cluster and put multiple of these chips together we had to do something that's partitioning the system slightly differently than what people typically do and we had to decouple the memory to the actual compute and what we're doing we're disaggregating those two things to be able to scale model and training speed independently and we do that by streaming weights and gradients in and out of the cluster of our chips so we start with a a chip itself we then package it into a chassis so we can power it and cool it um then we add a memory attached to that wi that system and this allows us to train uh uh up to 240 trillion parameters on a single chip or a single chip packaged in a CS3 um then you can scale that even further and so you use a special interconnect that can allow us to do a a broadcast reduce uh event uh for the the the weights and gradient uh while still accessing the the external memory and we can scale almost linearly up to 2,48 nodes yeah we can do that up to that number because we actually have people asking us to build systems of that size now let's see what uh this means right it means that you can run data parallel on training on all those CS3 the weights are broadcast to all CS3 and the gradients are reduced on the way back and you can achieve multiple system scaling uh with the same execution model as a single system this means that programming the entire cluster is as simple as programming one same system architecture same network execution flow and same software user interface so this is key because if you think about um cutting experiment experimentation time if you look at an Nvidia GPT 175 billion uh model it took about 20,000 lines of code uh to create that uh that model okay so you need to do a lot of things to make it happen and I'm sure some people stayed the entire Sunday here trying to do some of this different scale but you know how hard it can be to actually do uh uh make these large uh language model work especially when you have thousands of gpus in our case because this is seen as a single unit we were able to simplify the abstraction such that you only need to program in in Python and 565 lines of code now if you apply the same ratio it means that what might have taken you 3 hours on Sunday you could have done it in five minutes and be done with it so this is just on the programming side right I'm not even talking about how fast it executes yet all right so if you put this all together what we've able to do is to create a very large chip which we packaged into a chassis you can see it here uh seeing from the front so that we can make it good looking and you can see a little bit of the inside most of it is actually power delivery and cooling it's water cooled um part of it is also air cooled anyway I don't want to go into too many details here um we actually deployed it in our own data center in Silicon Valley at first uh this is an example another example uh of a larger one this was Condor Galaxy 1 also in a Silicon Valley and then we started doing uh a 64 node uh Condor Galaxy 2 in Stockton which is actually on a barge on a river so we can actually take advantage of the water cooling of the river for it um this is a pretty cool uh installation and we keep building those and getting better and better at it and uh to the point where we're now starting to with our partner G4 to to create an8 exaflop um fp16 supercomputer and this has 56 million AI cores has 64 systems interconnected together this is a monster supercomputer and we are deploying this in in Dallas um Texas so we have so far being able to do two complete Condor Galaxy one and two uh in Santa Clara and in Stockton and the next one in Dallas and there are more to come I can tell you that I've been quite busy trying to get the other ones uh further along um so what do we do with these well here's a good example uh here's an example where together with g42 we trained a a world leading Arabic llm I mean there are 400 million people that speak Arabic that had no model train for them so we were able to do that for them and this has been adopted by uh Microsoft as a core llm offering in the Middle East and this is available uh on Azure this is just one example another example is our partnership with Mayo Clinic uh there we do a uh um a lot of um work to help them use the fabulous data set that they have they have a lot of data and they finally got uh to the the point where we can use the data in a I would say secure Manner and controlled matter so we can actually develop and um enhance some of the health care uh tools that rely on large language model this is a fantastic opportunity and I hope that uh in Europe we might have a similar endeavors where we can uh make this type of uh tool and and infrastructure available uh to you people I mean most of you are here that could be actually benefit from this there are other people that um in in the energy sector which is also a big part of the GDP if you think about it um that started pay attention to this because they tried uh to use this and and try to solve some problems and they got a little bit of uh speed up with this actually quite a bit not one ORD but two orders of magnitude for this which is impressive I think um and they were impressed too and they were also extremely pleased by the ease of use and ease of scalability uh with this type of uh solution another one is c um C has been um uh borrowing 48 of our systems and doing some experimentation uh to trying to run some workloads where they found out that uh it outperformed 37,000 gpus so they run the workload where they were able to outperform the number one fastest super computer in the world this is um eye opening for some people especially when you can achieve that by only uh investing about 1% of the cost of a supercomputer traditionally so this has been uh for us a tremendous um uh opportunity to grow the company uh we can actually uh order more than we can build today we have a a huge growth ahead of us uh but it has been quite challenges in the past too so we had a lot of uh good people and talented people that were able to join us uh to this and it's been a fantastic Adventure so I want to conclude here I don't want to make it too long because I know we have a a nice apperitive after this and I'd like to to taste some of what um is to come so I don't want to bore you with this but as a summary right uh we want to enable all um to train at ease very large language models okay we see that uh the models continue to grow in size exponentially actually and very few companies can actually afford to train those very very large models today only the very large five or six uh can do that we want to democratize that and enable anyone to actually get access to this and train this so that's where cerra architecture can Shine by making this accessible uh and and not too hard uh you don't need to hire 200 Engineers to sustain your your um llm cluster that concludes my my quick talk on this I hope you enjoy enjoyed it uh there are a couple of QR codes here available for you if you want to learn more I'm happy to answer any questions you might have now or after um the talk thank you so [Applause] [Music] [Applause] much good so after these dizzying numbers I think we have to sit down please thank you and um yeah we have the the QR code up so uh a lot of things to talk about first of all congratulations to this incredible what seems like an incredible trajectory at least from the slides you just showed but we talked a lot also before um and as you said and I I should have said that in the introduction as well you an epfl alumnus actually when was that I mean not to put you in the anyone remembers when was that 97 95 97 okay I forgot 93 maybe all right and so how I'm going to stop there I don't know yeah a long time ago I mean it's it's it's a bit of an obvious question but what's you how to put that um 93 but you you um how has this kind of I mean does this because we always talk about this right this this situation where we say oh you know we teach the students but then frankly the world is changing so fast right and in 10 years who knows what they will do so I mean here we are the 23 24 year delay in which the world has completely changed does your training what do you think about your training looking back so to me the training has been um essential to get where I am um there wasn't a day where I wasn't thinking oh gosh I remember that training class and I remember that Professor that I never wanted to listen to see it's actually useful now um and and it's way later that you realize that I never liked uh talking about heat exchange and and thermodynamics and I didn't like that at the time turns out it was super useful to actually get this off the ground so it is you never know at the time when you learn it you typically know when you actually use it um and the type of training I got here um I found was very different than the type of training I could find in people uh I got the opportunity to hire in the US where the training had been either way way more Specialized or too too thin um too broad and and not uh um I would say up to the task and so in between I was able with my training to actually span multiple domains uh if you if you think about uh what we had to do in this particular Endeavor you need to have thermal you need to have mechanical engineers you need to have electrical engineer you need to have software well you need to speak all four languages and they do speak a different language so being able to uh interact understand the pro problem and actually facilitate the discussions has been a strength um and so it helped a lot shockingly the reality is still based on physics is what you're is what you're saying yeah uh physics and um very often money so both are that's another type of energy good so let's dive right into the question you can see from the ranking itself there there's a sort of one that um You probably hear all the time so let's just get it out of the way right so how does this compare if that's the right word uh to Nvidia but I mean I guess obviously as we all know the bigger context here is that computation is the main constraint right now Nvidia is obviously the elephant in the room what's the dynamic here so first of all just to settle the question and the answer we don't publish uh performance results that's first second uh I have an anecdote for it um I was not involved directly but I had the opportunity to listen to what our my co-workers said about a customer that came to us and on the Friday afternoon they came to us panicked it's like we can't train this network we can't train this okay what's happening well we've been working on it yeah we have a p team of about 35 people and we've been working to trying to train this and uh since about a year we cannot make it converge and say okay well give us your model and a little bit of data they gave us the model a little bit of data by Sunday afternoon uh we got that and uh by Tuesday we had it trained and we said the customer look is this the problem you're trying to solve and they were like shocked because they had thousands of GPU on available for a year and they could not make it work so the problem I mean the question is is valid however it is often not the performance of a given chip that gives you the difference is very often how do you make them work as a cluster and how do you continue to make them work over time enough time to make it train completely a given Network some networks take like two or three months to train well it's hard to maintain an infrastructure that has failure that happened pretty much every day to make this happen so yes performance is one thing and for a given workload I can show you some numbers that are like 200 36 times I in one of the slides I mentioned right uh in some slid we've we've had some uh opportunities Beyond a thousand times faster but for other workload it is less it's more single digit uh Improvement it all depends on the workload but for the large language model I can tell you uh it we beat them anytime andly okay there are a couple of questions about costs that I want to somehow merge also in the interest of time so cost obviously in terms of uh cost for a chip cost also in terms of energy right what's the energy use of these things and then cost obviously also in terms of the software framework I mean people think about of course buying chips but then also there's the development around it uh Nvidia obviously has the mode with Cuda so how does this what's the can you walk cost of the chip we don't disclose it but it is actually quite cheaper to do this and it all depends compared to the performance you get so if you have a high performance on our chip suddenly well Nvidia might look like extremely expensive compared to ours uh in some other cases it's somewhat in the vicinity but it's um we tend to be uh cheaper uh in many aspects because we outperform them so you need fewer of their entities to do that cost of the chip is a portion of the entire system uh and it is not the most expensive part in the system usually the power delivery is what cost the most the most then the question about uh software um 80% of the company was for a long time uh software engineers and yeah you might think that doing Hardware is hard um doing software is harder and takes a lot more effort and patient and um and smart people and so we were lucky we could attract smart people to help us in this and we have a fantastic team that has been able to abstract the complexity of such a chip such that when you program it you can actually use pytorch and you don't have to use all the other assembly languages or other weird uh languages that you need to put it together and we have strapped all the complexity of the cluster so you can see this entire um uh superc computer as a single um i' a single uh GPU like uh component okay but we have to ourselves take on that uh burden and abstracted it for all the layers from testing every individual chip all the way to having a tool that allows you to debug it uh and so yeah yeah you did it so last question so you said in the beginning you made a big BET right and you made a bet on this particular architecture on this future can you probably not fully disclosed but can you speculate a bit on the bets that you're making now for the next uh five years because one is curious right do you expect this to continue and your architecture to remain remain the same or are you making new bets because it's worked so well for you so it feels to me that the BET of having a wafer scale engine that is a wafer that is not singulated that is not reconstructed as some might do uh was the right choice uh we see at every generation we ask ourselves is this really what we should do should we continue to do it this way and yes it has been um uh the way to to outperform all the other architectures uh going forward you would say well what do you do now you reach the size of a wafer well there are other ways to pack even more compute um that we can think about how you optimize the amount of memory how do you optimize the io and how do you optimize power and uh and cooling uh for such an architecture so yes I see us being able to keep keep up um it's going to be hard for everyone to do that but I think uh we are well positioned for well positioned well I I said yesterday um one of the biggest compliments people can give us at amld is when they come back and say the first time I heard about this was at amld I'm sure that's probably one of them because your trajectory is amazing if we can just extrapolate it a little bit then it's an absolutely crazy success story in the making yeah especially if you think about the time it took because to just convince our investor that yes stay with us another year please we just failed uh for two years to make anything so please go ahead um have confidence in us we they kept uh going with us and now for uh for the last three years four now even the company has grown has grown as I showed you from just packaging one chassis into now deploying multiple data centers across the world actually um and um we are growing and so it's a fantastic place to be where there is such a high demand and we have a product and a solution that is well tailored for that market so it's awesome well congratulations and it's been a huge privilege to have you here and as you said you'll be around also for the upall for a further discuss the the questions that I'm sure you will get thank you so much thank you so much thank you thank you

Share your thoughts

Related Transcripts

that event was interesting... || September Apple Event Recap and Impressions

Category: Science & Technology

Intro it's another september and with that comes yet another iphone launch and with that comes yet another night where i'm up until 4:30 in my bed watching a stream maybe one day i'll be in the spaceship at least once it was a pretty average september event we got a new apple watch we got some new airpods... Read more

OPENAI'S NEWEST GPT-O1 AI MODEL DEMOS THESE 6 NEW INTELLIGENCE UPGRADES | TECH NEWS

Category: Education

Today we'll break down open ai's new 01 ai model as we compare its six newest abilities so how smart is it now Read more

🌟🤯 Surviving the Competition: Mark Zuckerberg Reveals 15 Years of Growth! #shorts #short #ai

Category: Education

A lot of how we've grown up over the last 10 or 15 years was building our apps through phone platforms that our competitors control it's somewhat soul crushing to like go build something that you think is going to be good and then just get told by apple that you can't ship it because they want to like... Read more

New ways to search: Beetlejuice Beetlejuice (ft. Bob)

Category: Science & Technology

[upbeat music] [grunts to speak] Read more

AI changing lives: Spotlight on accessible technologies at the AI for Good Global Summit

Category: Science & Technology

[music] ai for good global summit is a leading action-oriented un platform gathering experts from all over the world to accelerate progress towards the sustainable development goals here stakeholders from the public and the private sector demonstrate their innovative solutions aligned with sustainability... Read more

#shorts The Shared Responsibility of Reducing Plastic Waste #ai #ethics #aiethics #podcast

Category: Science & Technology

So i also wanted to premise that idea with it's not only the users's responsibility because i hate the fact that right now i want to minimize my plastic usage i want to minimize as much as possible the materials that come in but when i buy tomatoes at the grocery store you know there's the paper bottom... Read more

Unleashing AI Power: Cerebras' Giant Chip Meets Meta's LLaMA 3.1 Revolution! #shorts #viralreels

Category: Science & Technology

Cerebra systems is revolutionizing the world of artificial intelligence with its massive wafer scale computer chip roughly the size of a dinner plate this innovative technology is about to take a significant leap forward as it prepares to integrate me's open source llama 3.1 onto the chip by putting... Read more

AI News: OpenAI Finally Released Their New Model!

Category: Science & Technology

Intro i just spent the last week at disneyland and of course the week that i'm gone turns out to be an insane week with tons of big announcements i'm a day later than normal on getting this ai news video out so i'm not going to waste your time let's just jump right in there was really two major major... Read more

Elon Musk Advice To College Students #shorts #youtubeshorts

Category: Education

Next question over here hi elon um my name is tracy and i'm not here for any reason related to my career or to my area of study i'm actually here as a very cool and only slightly overbearing mother uh to my 10-year-old daughter harper and a sister to my 14-year-old brother ben who are both in the audience... Read more

How the data helps you win Fantasy Football with ESPN's Greeny

Category: Education

Fantasy football has exploded over the years, arguably becoming one of the most popular american pastimes. and as any fan knows, fantasy football is all about that data. so today i want to talk about how data and data preparation really make ai fly, how it makes it soar, but especially as it comes to... Read more

Dangers of AI - Apple CEO Tim Cook promotes cryptocurrency during 9 Sep phone launch

Category: Education

वेलकम टू क्विंस इंग्लिश वर्ड कल यानी 9 सितंबर 2024 को एल के न्यू मॉडल्स फोन और वॉचेस एयरपोर्ट्स का लाच था गलो टाइम के नाम से व लंच का टाइटल था तो इसमें उसी टाइम पर जिस टाइम लंचिंग थी उसी टाइम पर फेक एआई जनरेट वीडियो जो है य पर ट्रेंड कर रहा था आर्टिफिशियल इंटेलिजेंस कितना खतरनाक हो सकता है इससे इस बात से हम अंदाजा लगा सकते हैं हैकर्स ने एल सीओ टम कुक का वीडियो लांच किया जो एआई जनरे था उनका फेशियल एक्सप्रेशन उनकी आवाज सब... Read more

Strawberry Q* SOON, Apple Intelligence Updates, $2,000/mo ChatGPT, Replit Agents (AI News)

Category: Science & Technology

Openai strawberry model imminent open ai strawberry model is imminent that's our first story for today according to reuters we have open ai plans to release strawberry for chachi pt in 2 weeks and jimmy apples the only reliable leaker has mentioned it as well jimmy apples last week all quiet on the... Read more