Is Groq's Reign Over? Cerebras Sets a New Speed Record!

Published: Aug 29, 2024 Duration: 00:10:41 Category: Science & Technology

Tags : ai Artificial Intelligence llama prompt engineering Prompt Engineer LLMs GPT-4 fine-tuning LLMs

Trending searches: cerebras

let's see how fast this is compared to Gro 1,800 tokens per second which is much much faster than Gro which is a surprising thing to say so far Gro was the gold standard when it comes to inference speed but it changed a couple of days ago when cerebrus introduced their cerebrus inference they are also known for their custom hardware for llm inference but unlike Gro they have been training their own models or models for their customers their INF endpoint can provide up to 450 tokens per second for the 70 billion version of Lama 3.1 which is 20 times faster than h100 gpus on hyperscale clouds and 1/5 of the cost to Showcase an example of their fast infant speed for llm they release this hi there how are you doing today I'm good how are you I'm functioning properly thanks for asking I don't really have feelings like humans do can you tell me what is the fastest API endpoint available for llm inference today you're asking about lab kits apis our local noise machine L&M inference endpoint is a premium feature for the fastest available endpoint I'd recommend our real time you're welcome have a good day they connected their inference API endpoint with live kit which you can use to build Voice Assistant you can also use word to do the same that's my own open source project we're going to have a direct comparison with Gro but before that let's look at some technical details they claim the new wafer skill technology will give you the world's fastest inference that means you get twice the number of tokens compared to croc or 20 times the number of tokens compared to an h100 and because of that high throughput they are able to achieve lower cost but I think the most important part is that they are able to do the inference in full 16bit Precision which is not the case for most of the other inference API providers and there is research which we're going to look at uh later in the video which shows that the Precision at which you're doing inference is extremely important they kind of broke the plot from any scale so here uh on the x-axis we have output speed speed output tokens per second and on the Y AIS we have million tokens per USD you can see that this gives you the best possible inference speed on the market today as well as the cost is extremely competitive in fact it's only 60 cents per million token for Lama 3.17 billion so Rus also has some very impressive people behind them next first we will look at at the inference speed between Croc and cereus inference and later in the video I'll show you the accuracy of inference which also is pretty surprising so in both cases we're going to start with the 8 billion model here I'm going to select the Lama 3.18 billion instant and the prompt that I'm going to be using is going to be to write the name of all us governors of all states from 19 1920 until 2024 I want to see how many max number of tokens both of these models can actually generate so if we try this we can see we already run into context limit uh issues the number of tokens that we are getting are 750 tokens per second for the 8 billion model something which you want to pay close attention is to the number of input tokens which is 65 in this case and the output is limited to 248 tokens and that's why we don't already get the whole list uh Because the actual list is well beyond 4,000 tokens in now we're going to run the same test on the cereus inference so we will select L 3.18 billion here I'm going to paste the same prompt we're going to run the test again now again uh I don't think it gives us a complete list because it already uh reach the context limit or the maximum number of tokens that they allow but it gives us an impressive speed of 1,800 tokens per second which is more than twice what groc was able to do in this case it's using 45 input tokens so I think the prompt template that Gro and cereus are using are very different because there is a difference of about 20 tokens it also gives you almost twice the number of output tokens compared to Croc especially on the chat interface now when it comes to the API it's a different story over there cerebrus limits it to about 8,000 tokens on the free account which I believe groc does a very similar thing next we're going to try the 70 billion version so we're going to use the same prompt again again we reach the context limit and this is giving us 250 tokens per second again it's using 65 input tokens compared to 45 that we saw in cerebrus now if if we run the same test to cerebrus I think we're going to see very similar results now in this case for some reason it it says what a Monumental task you have asked of me and instead of giving us the whole list it just gives us a few resources where you can get the information from but in terms of the number of tokens per second that it can generate is almost twice that of groc now with a little bit of prompt engineering just by telling it yes you can give me a complete the list we get a list now this is not complete because it also runs into the max context that it can generate but at least we get a list very similar to what we getting from groc now the question would be why would the same model which is served on two different platform will give us different results for the same prompts so it can depend on some of the hyper parameters that they have set also it can depend on the quantization level that is being used which is a topic the cerebrus team has tried to address and it's actually a very important topic if you're putting any llm in production okay so they have this blog post titled Lama 3.1 model quality evaluation and they're comparing cerebrus hosted Lama 3.1 with Croc together and fireworks AI according to this blog post not all Lama 3.1 models are created equally and there are a couple of very interesting papers that they site in here and these papers shows that the quantization level that you're using can have severe impacts on the performance that you expect from these llms so they ran different benchmarks with the apis from these different providers these plot shows Lama 3.17 billion instruct and 8 billion instruct serve through different API providers including fireworks to together AI Croc and cerebrus together Ai and fireworks uh do use a quantization uh scheme Gro is probably also using some sort of quantization but we don't really know the details but for the same models serve through different quantization levels and different providers you actually see some Stark differences for the 8 billion model you see that the 8 billion Lama 3.1 serve through together AI consist ently performs worse compared to other API providers and for some of these benchmarks there's a kind of a trend so for example if you look at the MML or the math benchmarks then together Ai and groc is doing worse compared to uh the cerebrus uh in inference for code evaluation or coding related benchmarks this is I think more clear there's a consistent Trend that the other models are performing worse on the same benchmarks although they are serving the same uh models they also did uh evaluation for multi- turn conversations if you look at the single turn conversation together and Gro are doing pretty bad for multi-t conversation they are doing better we're talking about the 70 billion model but for the 8 billion version uh it's actually the opposite so for single turn this model is doing pretty good even if we serve to Croc together or fireworks but for the multi-turn conver conversations there is a drastic reduction in performance when it comes to Grog together and cerebrus these results shows that the same models serve through different quantization level and different inference hyperparameters can give you very different responses and hence different results on uh evaluation metrics so you definitely want to consider these things if you're putting these models in production okay at the end of the video let's talk about the API so cerebrus also serve this model to the API you are limited to a contact window of 8,000 tokens and that is because they offer 8,000 tokens contact window for free tier the API standard is the same as standard so this can be a drop in replacement now personally I haven't got access to uh the API yet so that's why I'll not be able to show you any code examples but if anybody can hook me up with an API that will be highly appreciated you will have to uh join the weight list for this I have already done this so if there is anybody from cerebrus watching this video please reach out it's great to see there is competition when it comes to inference speeds higher inference speeds definitely opens up a lot more possibilities especially for real neartime interactions groc was the leader in this space and hopefully they will come up with a new update anyways I hope you found this video useful thanks for watching and as always see you in the next one

Share your thoughts

Related Transcripts

New ways to search: Beetlejuice Beetlejuice (ft. Bob)

Category: Science & Technology

[upbeat music] [grunts to speak] Read more

Multi-LoRA with NVIDIA RTX AI Toolkit - Fine-tuning Goodness

Category: Science & Technology

Imagine this you're an ai application developer and you need to fine-tune your model for your use case but you need to fine-tune multiple models a new technique called multil laura is now available in the nvidia rtx ai toolkit it allows you to create multiple fine-tuned variants of a single model without... Read more

Strawberry Q* SOON, Apple Intelligence Updates, $2,000/mo ChatGPT, Replit Agents (AI News)

Category: Science & Technology

Openai strawberry model imminent open ai strawberry model is imminent that's our first story for today according to reuters we have open ai plans to release strawberry for chachi pt in 2 weeks and jimmy apples the only reliable leaker has mentioned it as well jimmy apples last week all quiet on the... Read more

Unleashing AI Power: Cerebras' Giant Chip Meets Meta's LLaMA 3.1 Revolution! #shorts #viralreels

Category: Science & Technology

Cerebra systems is revolutionizing the world of artificial intelligence with its massive wafer scale computer chip roughly the size of a dinner plate this innovative technology is about to take a significant leap forward as it prepares to integrate me's open source llama 3.1 onto the chip by putting... Read more

How the data helps you win Fantasy Football with ESPN's Greeny

Category: Education

Fantasy football has exploded over the years, arguably becoming one of the most popular american pastimes. and as any fan knows, fantasy football is all about that data. so today i want to talk about how data and data preparation really make ai fly, how it makes it soar, but especially as it comes to... Read more

The Rise of AI-Generated Girlfriends #ai #generated #shorts

Category: Entertainment

[संगीत] वक्त बरबाद ना फिन बात की बातों में कीजिए आज की रात मजा हुस्न का आंखों से Read more

Apple Intelligence | Custom memory movies | iPhone 16 Pro

Category: Science & Technology

Dad: fishy... mom: (interrupting) swimmy. dad: of course, of course it's swimmy. i remember... bringing you home, and um... you would swim around the tank. and um, i loved the way that, um, fishy would - mom: (interrupting) swimmy. bella: (interrupting) swimmy. dad: where did i get fishy from. of course... Read more

Remembering 9/11: A US-Canada Perspective + iPhone 16 & Pixel Fold Deep Dive

Category: People & Blogs

[music] welcome back to another episode of united we stand divided we podcast i am robert from the us just outside nashville and with us we have our co-host lionel from toronto toronto toronto you're get you're getting better i was just i just don't mind you see me leaning it's because i got a slightly... Read more

Scarlett Johansson Surprises as Time's Most Influential in AI – Elon Musk Left Out!

Category: News & Politics

In a surprising twist scarlet johansson has edged out elon musk in time magazine's most influential people in ai list causing a stir worldwide time renowned for highlighting the leading figures in artificial intelligence faced backlash for not including musk in its prestigious category despite his significant... Read more

Training the largest LLMs, Cerebras Wafer-Scale Architecture | Keynote 3 | Jean-Philippe Fricker

Category: Science & Technology

[music] thank you thank you so much and thank you for attending this session i know you had uh maybe four days of uh hard labor uh i'm going to try to show you how your labor could be somewhat simplified with what we've done at cerebras um so yes my name is jean philip freer and as you know when you... Read more

Nvidia CEO Jensen Huang on Earnings, Demand and Blackwell Chip (Full Interview)

Category: News & Politics

[cc may contain inaccuracies] i think the market wanted more on blackwell. they wanted more specifics. and i'm trying to go through all of the call and the transcript. it seems like a very clearly this was a production issue and not a fundamental design issue with blackwell, but the deployment in the... Read more

Dangers of AI - Apple CEO Tim Cook promotes cryptocurrency during 9 Sep phone launch

Category: Education

वेलकम टू क्विंस इंग्लिश वर्ड कल यानी 9 सितंबर 2024 को एल के न्यू मॉडल्स फोन और वॉचेस एयरपोर्ट्स का लाच था गलो टाइम के नाम से व लंच का टाइटल था तो इसमें उसी टाइम पर जिस टाइम लंचिंग थी उसी टाइम पर फेक एआई जनरेट वीडियो जो है य पर ट्रेंड कर रहा था आर्टिफिशियल इंटेलिजेंस कितना खतरनाक हो सकता है इससे इस बात से हम अंदाजा लगा सकते हैं हैकर्स ने एल सीओ टम कुक का वीडियो लांच किया जो एआई जनरे था उनका फेशियल एक्सप्रेशन उनकी आवाज सब... Read more