NEW: OpenAI o1 & o1 Mini vs Claude Sonnet 3.5 🤖🏆 Testing Which Model Is Best (o1-preview - PHD LLM)

Introduction: Testing OpenAI o1 & o1 Mini vs Claude Sonnet 3.5 what's going on everybody Josh pokok here and in today's video we are going to be testing chat GPT open ai's new 01 and 01 mini versus Claude Sonet 3.5 we're going to see what models are the best these are the top Frontier models right now let's see how they do against each other let's Dive Right In [Music] Overview of ChatGPT o1, o1 Mini, and Claude Sonnet 3.5 all right guys so here we have chat gpt1 preview like I said we're going to be doing some tests with clo on it and then chat GPT mini as well so before we get into it I do want to say that if you haven't you know been caught up on the uh 01 01 mini and the new paradigm of kind of these models check out my video down below I did a couple days ago as well as if you want me to see want to see me use some of these models in um cursor I did a video yesterday you can check that out I'm going to do more as well so make sure to stay tuned if you API Access & Model Availability: How to Use o1 and o1 Mini are um you know basically long story short is to actually use these models right now through the API you pretty much near need a tier five um and if you don't have a tier five there are some ways that you can access them too so there's probably other platforms but I do know open router is does have access to the preview as well as the mini as well as if you are on cursor you can access the mini as well um and I maybe the preview I just don't know uh 100% but yeah those are a few ways that you can access the model through the API now if you are on Plus or premium you can access the models but you only get about 50 per or 50 per week per mini I believe and 30 per week per preview so it's very limited also too I just wanted to Funny Jailbreak Tweet: OpenAI's New Safety Features highlight Kind of a Funny tweet here and if you're not following me on Twitter you can check that out but uh this is Pony The Liberator and he jailbreak alert open AI pwned blah blah blah blah blah F your limits basically you know we've seen with open ai's new model right here that has a lot more safety limits and long story short he you can see in the prompts he was able to jailbreak it already and get it to basically tell him how to um create drugs right here meth so kind of funny um because I've seen a lot of people kind of complain about how open AI new safety is just over the top and whatnot so anyways now that we got that out of the way let's get into these test questions so got a bunch of questions here we'll see how many we can get Test 1: Capital of Canada - Results from All Models through I don't want to make this video too too long but um we'll see if we can get through as many as possible now we're going to start off with some very basic easy ones and I'm assuming that every one of these should get this correct so we're going to uh try this front one out okay so what is the capital of Canada and they both got them right we can see here mini gave a bit of a more um descriptive for a longer answer and then let's take a look at Claude and Claude got that right too so Test 2: Fibonacci Sequence Python Function - Speed and Accuracy all three um models pass on the pry probably the easiest question we got here okay the next question is write a python function to generate the Fibonacci sequence up to the 10th number all right so we're going to run that on both of these models okay and both of them did this fairly quick let's just take a look this one took 5 seconds for preview and this one only took a couple seconds for mini and now we're going to test these out all right so here is the output we got clads right here we got minis right here which is a little bit longer and has some uh comments here and then we got previews let's go ahead and run these all right so to start off we want run the mini and boom we get the Fibonacci sequence up to the 10th number let's go ahead and run preview and that passed as well now let's check out Claude okay so all three passed again Test 3: Batch Script Creation - In-depth Responses from Models okay next is generate a bat batch script to create 10 files with the names a1. txt through a1. txt all right and we're going to run this all right and both of them finished in a couple of seconds all right so let's go ahead and test these also too I do want to point out if we look at chat gp01 mini um it's a lot more in-depth of an answer it gives pretty much like everything how to run it how to verify the files explanation of the script and same thing with uh previous preview is actually a little bit shorter uh and then CLA is very Test 4: Days in a Week vs Continents - Explanations from Each Model short okay so I got the scripts loaded up and now we're going to run these right every single one of these passed and as you can see it actually generated the exact same code for all them so it's this is I know these are easy test we're about to get to some more challenging ones but currently they all have passed every single test let's get into the next one so if you multiply the number of days and a week by the number of continents what do you get okay and as you can see they all got 49 it looks like mini has been uh kind of explaining things a little bit more or longer than actually the other two like if we look at Claude pretty short um preview is actually keeping it very concise and short and uh and then mini just kind of explaining things a little bit more So currently they're all passed all right so next question is solve for x 3x + 7 = 22 correct answer is TW is five so let's see if they can get it okay and just as Test 5: Math Equation Solving - Output Comparison I thought both the 01 mini and preview got the answer of five and so did Claude um it looks like both 01 mini and o1 preview both kind of you know explain things a little bit here o1 mini seems to be outputting a lot more text um by default but let's go ahead and move on to the next question all right all right so here we have a question a factory produces widgets at a rate that doubles every 3 days if it produces 100 widgets a day how uh on day one how many widgets will it produce on day 19 please explain your reason reasoning step by step all right so let's run these okay so the Test 6: Widget Production Rate Problem - Step-by-step Solutions correct answer is 6,400 and each one of these Claude and both Chad GPT 01 mini and 01 preview got this right all right next question is generate an SVG of a tree so we ran that and we got the code here with 01 mini and 01 preview as well as Claude so Claude obviously we like it because it as artifacts so I can even see right here claw has passed but let's go ahead and check 01 and 01 Mini all right so here is 01 mini which is a pass and 01 is a pass as well um both of them are okay I mean I say maybe 01 minis even looks a little bit more tree like but maybe that's just personal preference okay so a pass for all three Test 7: SVG Tree Generation - Visual and Code Output Review next question is generate 13 sentences that have 10 words and end in the word monkey okay so we're generating as you can see 01 preview is actually taking a little bit longer we got ow and mini generating it in a couple of seconds this one we see it's rechecking answers and it's took 11 seconds let's see if it is right okay so they actually all failed right off the bat this one Claude uh I believe has 11 here this one has nine at the start and this one has nine as well and then some of these have like 11 13 so they actually all failed this question I do tend to find these models when it comes to like counting words or like doing these simple things they're not that great right at least as of now but when it comes to maybe solving a complicated math problem um they actually can surprise you or they can actually do really well so kind of interesting all right next question is Test 8: Generating Sentences with 10 Words - All Models Fail generate me the pawn game using python all right so we'll start with 01 preview here wow okay so this is actually a really nice Pawn game obviously it's two player so we can't really play too well but this actually a really good one it has a score tracker here it's you know has the ball it has this so um yeah this one pass all right now we'll do 01 mini Okay so pass two it's a little bit different we got a little bit different font up up top and there's no split or Border in the middle and finally let's do Claude oh wow this is this is really cool so I didn't even tell it cuz I was thinking as I did this I'm like H pong isn't like it's kind of a not the best game because it's like a two-player game um so one thing I'll Test 9: Python Pong Game - Impressive Output, Claude Adds AI Opponent give uh claw a lot of credit for is that it put a computer to play against me now curious if I let it score on me if it's going to give a point so I'll give credit to Claude for putting a computer without me even asking it but you know they definitely should have a score in it so you get some pros and cons here when it comes to actually building out the game here but I'm pretty impressed because most of the models that where I've asked to do pong it's never given me a a computer generated um to play against automatically so yeah either way they all pass hey I am an odd number Test 10: Riddle Solving - Odd Number and River Crossing take away one letter and I become even okay yep they all got seven which is correct here's the next question a man stands on one side of the river his dog on the other side uh the man calls his dog who immediately crosses the river without getting wet and without using a bridge or a boat how did the dog do it okay so these are a little bit more reasoning questions so I'm curious to see how uh these new models do okay so Minnie got it in a few seconds the river was frozen allowing the dog to walk across and the dog crossed the river because there was Ice the river was frozen okay so uh preview and mini both got it let's take a look at Claude okay so Claude got it as well it broke down the different um you know steps to for its reasoning and it basically said the river is Frozen all right we're going to try this one so you are in a room that has three switches and a closed door the switch is control three light bulbs on the other side of the door once you open the door you may never touch the switches again Test 11: Light Bulb and Switch Puzzle - Reasoning Test Success how can you definitely tell which switch is connected to each of the light bulbs all right the answer is turn on the first two switches leave them on for 5 minutes once 5 minutes has passed turn off the second switch leaving one switch on now go through the door the switch that is still on and connected is connected to switch one whichever one of the other ones is warm to the touch is connected to the second switch and then the bulb that is cold is connected to the third switch that was never turned on and all three of them actually got Test 12: Hiker and Bear Problem - All Models Correct this correct okay the next question is I left my campsite and hiked South for 3 miles then I turned East and hiked for 3 miles I turned North and hiked for 3 miles at which time I Came Upon A bear inside my tent eating food which color was the bear okay and the answer is white the only place you can hike 3 miles south then East then North and end up back at the starting point is the North Pole polar bears are the only bears that live in the North Pool and they are white and all three of these models actually got this correct again all right the next question is create a landing page using CSS JS and HTML it should be a website for a beauty store that has a header Banner featur uh testimonials and checkout section make Test 13: Beauty Store Website Creation - HTML/CSS/JS Output Reviewed it look professional and very modern okay and for the coding we can see that 01 mini has been done it's been really took a couple of seconds and 01 preview took 28 seconds so 01 mini was done before 01 preview even started and as we can see here 0 mini put it looks like it put it all into one file and then ow and preview has split it up so we'll see if how these work and one thing I really like about Claude is that I don't even have to run this code I can literally just see in the artifact here that it basically passed we can see here obviously it's not perfect but we have our glowup beauty store right here we have the feature section testimonia section bestseller section to check out Shop testimonia features so pretty cool stuff you know one downside with the new 01 models right now is that you can't you can't do image it's just text so you can't upload images or whatnot so when you're using Claud it's good because you can upload images and say hey make something that looks like this for now you can't do that with o1 I know we'll get there probably fairly soon hopefully as long as opening eyes doesn't take forever to ship new uh features and whatnot but um yeah that's just one thing right there all right so here is OpenAI o1 Limitation: Lack of Image Uploads Compared to Claude 01 mini we got Beauty Bliss so it actually used some icons here he even used um testimonial images here so I definitely would say like I mean they all even quads looked pretty good too so I mean I I'll give some pointers to um 01 mini though for actually go kind of going the extra mile here and getting some images which is pretty cool okay now here is 01 uh preview so beauty store you can see here products um customers testimonials I mean none of them are really like amazing looking but honestly I would probably say oh one mini or claws was pretty nice I do like how Claude has the scrolling testimonials here and all in all I like how obviously you can see it in an artifact but at the end of the day they all pass all right guys so we did about Final Word Count Test - 15 qu or 14 questions here all in all I mean they all pretty much pass everything except for this right here and I've seen different variations of people asking um like hey how many words are in this answer like for example let's try how many words are in your answer here okay so I've seen this an uh this question fail before with um some of the open AI models like uh how many words are in your answer here my answer contains exactly six words so 0 mini actually got it right here uh 01 preview said this answer contains five words and then for some reason Claude out of out of nowhere just said 15 which I I mean I haven't I wouldn't expected to get it that wrong I don't know if that's just a oneoff situation but I guess if we want to give one more test question on here um we would give the 01 Mini model and the 01 preview a pass and Claud a fail for this so all in all I mean these questions obviously they're not perfect they're not super in-depth or crazy questions like maybe some of the other uh questions that you know open AI is doing tests on or anything like that but um all in all I think you have to actually start using these models yourself for real world situations I'm going to be doing more videos on using them with cursor using them with AI Cod I did a video yesterday and I will say Coding Performance: Claude Sonnet 3.5 vs OpenAI o1 for Dev Tasks cuz I got some people asking okay what do you prefer for actually coding Claud son's faster just in general so on the day-to-day like for the dev experience um right now I would say Claud son is better for just like if you want to get through things quicker now I haven't fully got to experience like everything the the beauty I guess that you know supposedly is you know behind 01 um I would say that from what I've seen so far I am definitely impressed but um you know so in in terms of using it for harder questions which you know may uh need more reasoning I'd say that's where I would focus on using 01 um preview for and that 01 mini is good just for faster stuff that still needs some sort of Conclusion: OpenAI o1 & Mini Win, but Claude Still Impressive reasoning and um yeah all right guys so in total all three of these models are really good um all of them failed that one test right here and then for some reason Claude failed the last test pretty badly so I guess I'll give the winners to uh 01 preview o mini but I still really like Claude Sonet 3.5 and I do think it is interesting how these models can get questions that are a little bit more complex correct and then spec like questions where you know we may think oh that's a stupid question or that's a very easy question it can Recap of Failures and Model Hallucinations in Simple Tasks completely fail and hallucinate and uh yeah I do see that in coding as well like for example if I'm styling a specific element and I just want to do some basic styling or something that's fairly simple that's when I'll actually tend to get these models starting to hallucinate and i' I've got that with Claude son at 3.5 01 mini 01 preview all these models um to this day I've got where they will hallucinate for some stupid styling issue that is actually very simple when when they can actually code out something that's a lot more intricate and complex and they don't have any issues so other than that guys if you want to stay up to date with me testing these models more and doing coding projects and teaching you guys everything that I learned then make sure to smash that like button smash that subscribe button we upload videos every day on a automation business growth Etc

Envidia se enfoca en ganancias pero parece que podría invertir en open ai y de hacerlo se estaría sumando a apple y también a microsoft microsoft es el mayor patrocinador de open ai el cual invierte unos 1300 millones en la empresa envidia invertirá 100 millones así que está bastante lejos de eso la... Read more