Predict Baseball Stats using Machine Learning and Python

Introduction hi my name is vic and today we're going to learn how to predict future baseball stats given a player's historical data we'll start out by downloading baseball stats using python we'll clean the data and get it ready for machine learning then we'll pick which columns we want to use as predictors in our model we'll train a model and we'll evaluate how well it performed we'll end with next steps that you can use to keep improving the model on your own let's dive in and by the end we're going to have a model that can predict the next season stats for a baseball player and we'll also have diagnostics that we can use to continue to improve that model on our own and what we're going to try to predict is the winds above replacement or war that a player generates in a given season so wins above replacement measures how much better or worse a player was than just an average replacement level player that could have filled their spot so a positive war indicates that a player is good a negative war indicates that a player is worse than just a replacement level player and what we can do is we can look at the stats of aaron judge who's now one of the best baseball players in the world and we can look at his season stats so in 2016 if you look to the right you see that his war was negative but in 2017 it was positive 8.8 which is a very high war indicates he was one of the best players in baseball that here so what we're going to try to do is use the stats from one season to predict a player's war the next season all right let's dive in and get started i'm going to start coding so there are a few libraries i'm going to import i'm going to import the os library which is Download the data a python library that lets us interact with parts of the system i'm going to import pandas as pd so pandas is a data manipulation and analysis library i'm going to import numpy np numpy lets us create and work with arrays of data and then i'm going to import a library called pi baseball and this is a python package that lets us download baseball stats from various websites so i'm going to go ahead and run that and let them import and then we're going to be working with baseball stats from the 2002 season through the 2022 season so i'm just going to define variables called start and end that tell us the range that we want [Music] and then i'm going to go ahead and load the data so to do that we're going to call this batting stats function we're going to pass in our start and our end seasons and then we're going to pass in a parameter called qual so qual determines how many plate appearances at minimum we want a batter to have before they're included in our data so we're only going to take batters who have at least 200 plate appearances so i'm going to go ahead and run that it could take a second to run after it runs you want to run two csv to actually dump this to a csv file so you're not constantly having to re-download it from the internet which can take a while and isn't very nice for the server that you're downloading the data from so when you run this again you can just load that csv file all right now the first thing we're going to want to do is we're going to want to remove players that we only have one season of data for and the reason for that is we want to make predictions about a player's war in their next season so if we only have data for a single season for a player we won't be able to make that prediction so what i'm going to do is i'm going to group my data by idfg which stands for the id in fan graphs which is the website that batting stats downloads our data from i'm going to say group keys equals false if you use group by without passing this in pandas will change the index to include the values from the field you're grouping on this just avoids messing with the index and what group by does is it splits our data frame up into groups based on the value in a column an idfg is a unique id for each player so we're basically splitting our batting data frame up into groups based on the player id so each player is going to have their own group and then we're going to run the filter method which helps us remove some of the groups and in this case what we're saying is we want to remove any groups that have only one season of data we only want to keep groups where we have at least two seasons of data for the player so this is saying evaluate this function for each group and the function is checking if we have at least two seasons of data if we do keep the group if we don't get rid of the group so let me run that and then we can take a look at batting here and we can see we have 6737 rows these are all the batters from 2002 to 2022 who have 200 plate appearances and we had at least two seasons of data for so we see some people whose names you may know if you know baseball like barry bonds and mookie bets [Music] Creating an ML target all right now we need to set up the target that we're trying to predict and as you'll remember we're trying to predict the winds above replacement for the next season so in order to do that we actually need to split our data up by player again and then for each player we'll actually fill backfill the wins above replacement value from the next season as our target so we're going to write a function that takes data for a single player and we're first going to sort the data that we take in by season and then we will create a column called next war and that's just going to be the wins above replacement shifted back one row and shift i get a lot of questions about usually so i will show you what this actually does after i run the function and then we'll return our player data and then to run this function we have to again use group by to group by idfg we'll say group keys equals false and then we are going to apply our next season function so this is basically splitting our data frame up into groups based on the player id so each player is split into a different group and then for each group we're basically computing our next war for each season so you can take a look at this idfg and you can see that it's unique for each player all right so let me run that and then let's take a look at what the shift method is actually doing so let's take a look at player name season war and next war so let's take a look at that so this gives us a data frame with just these four columns and if we look at this first player alfredo amazaga we can see that in 2006 his war was 1.1 but his next war was 2.0 and the reason why is his war in 2007 was 2.0 so we pulled this value back to 2006. same thing for 2007 his next war is 1.2 that's because we pulled this war value back from 2008 and assigned it to our next war column for 2007. you'll also notice there's a missing value for next war in 2008. that's because we don't have any data for alfredo amazaga past 2008. so in 2009 it's possible he didn't play or he had fewer plate appearances than necessary to qualify so we have a missing value here which we'll have to deal with later because we can't pass targets with missing values into machine learning algorithms all right now when we go back to our data we have 319 columns and you may have noticed a lot of missing values in some of these columns that's because some of these statistics require specialized tracking equipment in a stadium to actually compute and we didn't always have access to that equipment in every stadium and every season so these stats are missing for some players in some seasons Cleaning the data [Music] so what we're going to do is actually get rid of the columns where we have a lot of null values because it's going to be tricky to actually impute these values because not not all of them were available in every place at various time points so we you could impute these values if you want to actually do it on your own but it's a little more than i want to show in this video all right so first we're going to define a variable called nullcount and this is basically going to count up how many missing values are in each column so we can take a look at nullcount here and we can see some columns have no missing values some columns have quite a few missing values and we can see our next war column also has a few missing values like i showed before so what we can do is we can create a list called complete columns which is just going to be all of our columns that don't have any missing values so what we're going to say is we want to take we're selecting our columns so from our list of all of our columns in the batting data frame we want to select only the columns where nullcount is zero so the columns that have no null values and then we want to convert that into a list by default it's a pandas index the lists are a little bit easier to work with for some purposes so you see we end up with a big list of complete columns columns without any missing values now we can do is we can remove all of our null columns by indexing our batting data frame all right so what we're doing here is we're taking our batting data frame and we're saying we want to select all of the columns in the complete columns list plus the next war column you'll remember the next war column has some some missing values but it is our target that we want to predict so we need to actually keep it in our data frame and we're creating a copy because this helps us avoid setting with copy warnings later if you've run into setting with copy warnings they can be pretty annoying and just making sure when we do a lot of operations on our data frame we're actually copying it afterwards can help with that okay so let's check our batting data frame now so we have 132 columns now so fewer than we did before and all of our columns except next war have no missing values which is good for machine learning all right now let's check the data types of our columns so some of our columns are an object data type which is a string but most of them are integer or float data types which are numbers we can only use numbers in machine learning algorithms most machine learning algorithms don't work with strings so we need to actually find a way to handle these strings so let's find all of these columns in our data that have a string data type object data type and we have a few so we have name team doll and age range so let's take a look at the dol column this is a column where fan graphs essentially tries to assign a dollar value to the player based on their stats it's kind of a weird column and we don't really need it because all of the underlying stats used to compute this column are in our other columns of data so we can actually just go ahead and delete this column so delete our dollar value column then the other one that's weird is this age range column so let's take a look at this age rng and this is the age range that the player had during the season so some players will stay the same age for the whole season some players have their birthday during the season so they'll actually turn a different age so this this just gives you the age range of the player during the season we don't need this column it doesn't give us any useful information beyond just the age column so we're going to go ahead and delete it all right and then we can go ahead and process our team name so if you saw earlier our team name is a is a string like fla for florida a a for anaheim cle for cleveland etc these are strings but we can turn them into numbers so we can essentially assign a number to each team name and then use that in our machine learning model so we're going to create a column called team code and in order to generate this we're going to take our team name and we're going to turn it into a categorical type in pandas so we're going to say as type category let me actually delete the assignment and show you what this is doing so you can see we have 35 different categories and each team name is stored as a separate category then what we can do is convert each category to a number and what this does is actually turn each category so each team name from before into a number so if we if we look at the team here florida is is code 12 whereas anaheim is code 1. and then what we'll do is assign this to a variable called team code and this just gives us a way to turn our team name into a set of numbers that we can pass into a machine learning algorithm all right now we're going to make a copy of our batting data and the reason we do that is we're going to copy it to another variable because we're actually going to drop any rows where next war is is missing because we can't use that when we train an algorithm but those rows can be very useful for predicting the future so after the 2022 season ends you can use the rows from 2022 to predict what's going to happen in 2023 so we're just going to keep a copy of that data around so we can forecast what's going to happen next season if we want to and then we'll go ahead and drop those rows where next war is none so let's go ahead and do that all right now we're starting to get into the fun part where we can apply some machine learning models [Music] Selecting useful features but we have a problem we have 132 columns and i don't want to feed all 132 columns into a machine learning algorithm because some of these columns could cause the model to overfit or or have other other issues like multi-collinearity so what i'm going to do is actually run a feature selector that can pick a subset of features that actually help a model optimize its accuracy all right so we're first gonna have to create a model to use with the feature selector we're gonna use a ridge regression model then we're gonna import from scikit-learn our feature selector it's going to be called sequential feature selector i always make a typo when i type that i feel like and then we're going to import something called a time series split which we'll use as part of our feature selector all right so we'll first initialize our ridge regression model there's one parameter for ridge regression that you might want to play around with it's the alpha parameter if you're familiar with the ridge regression this is usually called lambda but it's called alpha in python because lambda is the key word for an anonymous function in python so if you set this higher it reduces overfitting because it penalizes the ridge regression coefficients if you set this lower it's closer to just a pure regular linear regression so you can just play with the alpha parameter and see what that does for you then we're going to initialize our splitter so this is a time series split and what this is going to do is split our data up into three parts and make predictions for those parts but it's going to do it in a time series aware way so we have time series data so if you look at this data for alfredo amazaga you can see that he played in three seasons so what we don't want to do is use data from the future so data from 2008 to predict what he did in the past in 2006 because that would give the model unfair knowledge of the future and in the real world if we were going to use this model to say predict what's going to happen next year in 2023 we wouldn't know what's going to happen in 2024 and 2025. so we want to set up our model in a way that mirrors how we would use it in the real world so we'll go ahead and set up this time series split and then we can initialize our sequential feature no typos come on all right i think i got it sequential feature selector we'll pass in our ridge regression model we'll define how many features we want this to select so we want it to select 20 features although you can play around with that number we're going to say direction equals forward and then we're going to say cv equals split so it's going to use our time series split to do cross validation and then we're going to say n jobs equals 4 so you don't need n jobs but n jobs is just a way to make it run faster by using multiple processor cores or multiple threads all right so that's sequential feature selector and one thing that i'll explain is just how this works so we specified direction equals forward so it's going to start by selecting zero features and it's gonna keep evaluating all of the features so it's going to go through all the features find the best one then go through all the features again find the best one go through all the features again find the best one until it has 20. you could also do direction equals backward in which case it would start with all 132 features and then keep removing them removing the least valuable ones until it gets to 20. i'm just gonna do forward because it runs faster but you can also try backward you can also try messing with n features to select so i'm going to go ahead and initialize all of these but we're not going to run anything yet okay i have a typo here it should be n splits with an s i missed the s before all right so we initialized everything but before we actually run our sequential feature selector we need to do a few things so the first thing we need to do is our sequential feature selector cannot work with certain columns so we don't want to pass our target in the thing we're trying to predict we don't want to pass text columns in because machine learning models can't work with text columns and we're going to take out just a couple of metadata fields so the player id and the season just to help avoid overfitting with the model we don't want the model to overfit to a particular season or a particular player we want to build a general model and then we'll create a list called selected columns which is just going to be all of our columns except the ones that we decided to remove all right so what this is saying is take all the columns from batting dot columns so that's our full list of 132 columns then pick all of the columns that are not in our list of removed columns and that'll give us our selected columns list okay and once we have our selected columns the next thing we need to do is actually scale our data so when we work with a ridge regression model we need to scale our data so that the mean is zero and the standard deviation is one in order for the model to work effectively we're going to do a slightly more aggressive form of scaling called min max scaling and we're going to import the min max scalar from scikit-learn to help us do that now the reason why we're going to do this more aggressive form of scaling is because we want to actually find some ratios between columns later and if we use a standard scalar from scikit-learn it will actually give us negative values and it'll make those ratios harder to compute so using a min max scalar we'll actually put all of our values between 0 and 1 so we avoid any issues with our ratios later and then once we've initialized the min max scalar we can run it by doing this so we type batting.loc and we select all of our selected columns so this this colon selects all the rows this selects all the columns and then to it we're going to assign scalar.fit transform batting selected columns all right now if we run this as is we're going to get a warning so before i run it let me show you how to avoid that warning so if we go back to cell 28 here you see batting dot drop n a dot copy so i added this dot copy and this helps us avoid a setting with copy warning setting with copy is a very confounding error at times it happens when you manipulate your data frame too many times and assign it back to itself without using dot copy which creates a totally fresh data frame without any views or anything inside so you'll just have to run this cell again which i'll go ahead and run and then once that's finished you can actually run this cell okay and once this has gone ahead and run our scaling is complete and what we can do now is get ready to actually fit our sequential feature selector okay and now that our scalar has finished running we can take a look at our batting data frame so let's go ahead and take a look here and what we see is that some of the columns have been scaled so when we look at our first four columns which were in our columns that we did not select obviously no scaling happened here but in these columns we can see that these values look very different from before and we can use the dot describe method to take a look at these changes in a more summarized way so when we start with the age column here we can see that the minimum value is now zero and the maximum value is one so that's what the min max scalar does it scales all our values to be between zero and one and now that we've scaled the data we can actually go ahead and apply our sequential feature selection so we're going to call the fit method of our sequential feature selector which will actually fit our selector to the data and fitting means that it'll actually pick the 20 predictors that give us the greatest accuracy with a ridge regression model so the way we do this is we pass in our selected columns and then we pass in our next war column so let's go ahead and run that all right and this may take a second but when it finishes you can go ahead and actually extract the list of predictors from the sequential feature selector and the way we do that is we use sfs dot get support and what this does is it returns a big array of true false and the trues are the columns we want to select so we'll we'll actually index our selected columns list with this method call and this will give us the list or the index to pandas index of the columns that we actually want to use as predictors and we'll actually convert this into just a flat python list so it's a little bit easier to work with pandas index can be a little more difficult to work with so we'll assign this to a variable called predictors all right so we have our predictor list we have our model we're almost ready for machine learning we do need to code up one more function [Music] Making predictions with ML and this function is really critical this function is called back test and what it's going to do is actually generate our predictions for us now you may be familiar with typical cross validation which is what we use to validate typical machine learning models on just regular data sets and what we do in cross validation is we split the data set up into several groups let's say three groups then we make predictions on the whole data set while still splitting the groups up so for group one we use groups two and three to train the algorithm then we make predictions on group one for group two we use groups one and three to actually train the algorithm then we predict on group two and for group three we use one and two for training and then make predictions on group three and that way we get predictions for the whole data set that's called cross validation cross validation doesn't work with time series data so with our data if we scroll up we see that it's in order of season and what we don't want to do is use data from future seasons to predict what happened in a past season that will cause a lot of issues because it doesn't match what happens in the real world so in the real world if we're trying to predict what's going to happen in 2023 we don't have data from 2024 and 2025 to train the algorithm we can only use past data so we want to make sure when we're evaluating our algorithm when we're testing it so we can predict error we want to use the exact same mechanism that we would use to use it in the real world so we only want to use past data to predict future data so we're first going to create a list called all predictions and this list each element in this list is going to be the predictions for a single season then we'll go ahead and create our list of seasons so i'll show you what this looks like before putting it into the function so we'll find all of the unique seasons across our data frame and then we'll sort them in order which will give us a list of seasons in order from 2002 to 2021 and we'll assign this to a variable called years and i'm just going to swap batting for data because the argument to this function is called data and then what we'll do is we'll say for i in range start length of years step okay so what's going to happen is each time through this loop we're going to use historical data to predict a single season and we're going to start with the 2007 season because we want to have a good amount of training data to make our first prediction we don't want to start in 2003 because then we'd only have one season of training data and our predictions probably wouldn't be very good so we're going to start with the 2007 season we'll use everything from 2002 to 2006 as training data and then we'll predict 2007. then the second time through the loop we'll use 2002 to 2007 as training data and predict 2008. the third time through the loop we'll use 2002 to 2008 as training data and predict 2009 and so on until we have predictions from 2007 to 2021. so the first thing we'll do in this loop is assign the current year so i is going to be 5 6 7 8 9 10 etc but what we want to do is extract the actual year instead of just having i so the actual year will be this the number of the season so the year of the season so 2007 2008 etc then we'll split up our training and test sets based on that year so our training set is going to be anything where the season is less than the current year and our test set is going to be anything where the season equals the current year so again the first time through the loop the test set is going to be the 2007 season and the training set is going to be everything from 2002 to 2006. and then we're going to go ahead and fit our model using our training predictors as well as our target that we're trying to predict so that would be the next war column which i hit enter accidentally so don't worry too much about that all right and then after we fit the model we can actually use the predict method to generate predictions on the test set and the reason we use the test set to generate predictions is if we generated predictions on the training set the algorithm would be generating predictions on the same set it's trained on so it'd be kind of like doing an open book test it doesn't tell you much about the quality of the model if you're evaluating it on the same set it's trained on so we evaluate it on a different set the test set then by default this model.predict method returns a numpy array but we're going to turn that into a pandas series to make it easier to work with so a panda series you can think of as a single column in a data frame so a data frame is a full table of data a series is just a single column then we're going to combine our predictions with our actual values using the concat method so we can compare the two so we're going to pass in our actual values so our next war column which contains the actual next season wins above replacement then we'll pass in our predictions and we'll concatenate on axis one and concatenating on axis one means each of the series that i'm going to pass in here should be treated like a separate column axis zero would actually just combine these into the same column so you just get one really long column instead we're going to have two separate columns where each of these is a column then we're going to sign the column names and the column names are going to be actual and prediction and then what we'll do is we'll go ahead and append this to all predictions so as we iterate through this loop we're going to make a prediction for a single season and append it to all predictions and by the end all predictions will be a list of data frames and each data frame will be the prediction for a season between 2007 and 2021. so we don't want to just return a big list of data frames we want to return a single data frame so we'll again use the concat function to go ahead and concatenate everything and by default pandas will concatenate on axis zero so we don't need to pass it in because it's just the default but it's concatenating on axis of zero which means it's basically combining all of the data frames vertically into one long data frame all right now we can go ahead and run this function and now that the function is defined we can use the function to back test so we're going to pass in batting our ridge regression model and our predictors so let's go ahead and run that and now we can take a look at predictions and we can see that predictions is a data frame with 4115 rows and the first column is the actual next season wins above replacement and the second column is the predicted next season wins above replacement now it's hard to tell just from looking at this table if the algorithm is any good so what we're going to use is something called a summary statistic to create an error metric so from scikit learn.metrics we're going to import mean squared error and this is going to give us a single number that tells us how high the error is in our model so we're going to call this mean squared error method or function and we're going to pass in our actual predictions and then we're sorry our actual values and then we're going to pass in our prediction it's going to compare both columns and what it's going to do is it's going to subtract the prediction from the actual value and then it's going to square the difference and then it's going to find the average squared difference across all 4115 rows and we get 2.76 so how do we know if this is any good so one thing i like to do is to look at the thing we're trying to predict next war and then i like to do describe all right so we can see the mean is 1.78 and the standard deviation is 1.98 so i like for the square root of this to be lower than the standard deviation just a rule of thumb not something you have to do but generally that indicates that your model is doing something better than just randomly guessing all right so what we'll do to do that is just copy paste this 2.76 number and then raise it to the to the power 0.5 which is going to take the square root and we end up with 1.66 which is lower than the standard deviation of our next wins above replacement column it's not great right it's it's actually pretty close to the standard deviation so it's not a perfect prediction but it's it's okay we can make this better so now let's take some steps to actually improve this prediction so in order to do this we're going to give the algorithm some information about how the player did in previous seasons so right now we're only telling the algorithm how the player did in the current season but let's say the player improved their war significantly from the season before to the current season that might be useful to the algorithm or let's say it went down significantly maybe the player's on the decline so giving the algorithm some information about how the player did previously can help it make better predictions [Music] Improving accuracy so we're going to create a function called player history and this function is going to take in data from a single player and just like before we're going to sort this by season so that it's in order and then we're going to create a few different predictors so the first predictor is really simple it's just player season so this is just a number that indicates which season this is for the player is it their first season their second season etc then we're going to create a column called war core which is just a war wins above replacement correlation this one is going to be a little bit complicated so i'm actually going to show you this outside of the function before i actually code it in the function so first things first let's find a player who has a decent number of seasons they played so that's garrett anderson and his id number is two so what we're going to do is create a data frame called garrett anderson and we are going to select only his seasons because the fan graph id is going to equal two okay so let's look at our data frame ga and we have seven seasons of data for garrett anderson all right so the next thing we're gonna do is we're gonna select just two columns so we're gonna select season actually we're gonna we're gonna first create a player season column and i'm going to go up here and copy the exact same code so we can take a look at ga and we see we have this player season column at the end actually when i first created ga i should have copied it that's how you avoid this warning all right so let's take a look at ga same thing we have the player season column all right and then the next thing we're going to do is we're going to select just two of the columns from ga we're going to select player season and wins above replacement all right now what we're going to do is we're going to call expanding and what expanding does is it creates different groups of the data frame the first group is just the first row the second group is the first two rows the third group is the first three rows the fourth group is the first four rows and so on and for each of these groups in expanding we're going to find the correlation between player season and war using the dot core method so in the let's say we're looking at the third group that expanding creates we're going to look at the correlation between 0 1 2 and these war values it looks like there's a slight downward trend but maybe we'll see what the actual correlation value looks like so we're going to run that and when we run the correlation it actually returns four numbers for each row it gives us so each row is represented by the index on the very left and then you see we have two rows and two columns we want to actually get just a single number so to select a single number what we're going to do is use the loc indexer but we actually have what's called a multi-index so the first level of index are these numbers like one one six eight eight six six the second level of the index is this stuff like player season and war so we're actually gonna have to specify two index values for the row so for the row we're gonna pass in slice none which means select all of the values from level one of the index and then we're just going to select player season from level two of the index and then we're only going to select the war column so this will take us from four numbers for each row to a single number for each row so this is what we end up with and then we want to turn this into a list to get rid of this index so then we do list okay so for every season that garrett anderson played this gives us the correlation between his season number and his wins above replacement for all previous seasons all right that's a that's a big statement but hopefully breaking it down made it a little more clear what's actually happening so then what we're going to do is i'm just going to replace the ga with df and this will be our war correlation and you might notice there is a missing number here so in that case i'm just going to fill n a with one which implies a one-to-one correlation all right and then what i'll do is also compute another column called war diff so this is the difference or the ratio rather between the current winds above replacement and the previous season winds above replacement so what we're going to look at is df next war divided by df war so this gives us our sorry actually i don't want to do this because that would be leaking information from the target what i instead want to do is i want to do dfwar.shift1 i was hoping to to do it in a slightly conceptually simpler way but that does not work so what this does is it takes the wins above replacement for the current season and divides it by the wins above replacement for the previous season remember when we do shift negative one it brings next season's value back to the current season when we do shift one it brings the previous season's value up to the current season so this will just be a ratio and then some of these values are going to be negative or infinite so we're going to fill in the n a values with a 1. that's what fill n a does it looks for all the missing values in a column and replaces them with the value you pass in in this case i want to replace it with a one the reason why i want to replace it with one is there was no previous season right so that's why the value is missing but having it as a one kind of assumes that war has been constant between seasons not the best thing to do but it's better than filling it in with a zero which assumes that war has has changed significantly okay now some of the values will also be infinite so division by zero can cause an infinite value sometimes and that shows up as numpy numpy dot inf numpy.infinite so we're gonna find any values in our word difference column that are infinite using this indexing and replace them with a 1. all right and then we can go ahead and return the data frame and then what we'll say is batting equals batting dot group by id fg so that's again grouping by player we'll do group keys equals false and then what we're going to do is call apply on player history all right so what this is doing is it's splitting our data frame up our batting data frame up into groups by player and then for each player it's calling the player history function and passing in data for that player right so we can go ahead and define that function and run this which will fill in that player history then we can write another function called group averages and this is going to help us find averages across a whole season and compare those averages to how our player did so what we'll say is return our players war divided by just the general war for that season so it tells us in a given season did our player perform better than the average player or worse and in some seasons like seasons that had a lockout or something else players played fewer games which means that the war would actually be lower but that's not because the player was worse it's because they just played fewer games so this helps to correct for that all right and then we'll create a column called war season and this time we're going to group by season so this will create one group for each season so it'll split our data frame up by season we'll again do group keys equals false to avoid messing with our index and then what we'll do is we'll apply this group averages function so it's going to find the average between how each player did in a season and how the average player did in that season so we'll go ahead and run that and then we can create a list called new predictors which is predictors plus player season our war correlation our war season ratio and our war difference so that's the ratio between the previous season and the current season then we can generate our predictions again i'm just going to delete these to make it clear so we're going to say predictions equals backtest so this is the exact same function call from before but this time we're going to pass in our new predictors instead of our old predictors i forgot to run this cell to define the new predictors which is important to do all right and now we can go copy our mean squared error call from before to calculate the mean squared error again so it's now 2.67 so that's a little bit better than earlier when it was 2.76 so we have improved with our new predictors [Music] Diagnosing issues with the model and we can take a look at how much each predictor is impacting the model by looking at the coefficients of our ridge regression model so these are the coefficients of the model and we can combine these with the names of our predictors in order to make it a little bit easier to tell what's going on so these are our these are the coefficients for our ridge regression model by predictor and then we can actually sort it just to make it more obvious which ones contribute a lot to the model and which ones don't so anything with a very small coefficient the model isn't really taking into account anything with a large coefficient the model is taking into a lot of account to make its predictions all right so this is the diagnostic that you can look at to try to improve the model okay which predictors matter which ones don't can you maybe tweak the predictors that do matter to try to give the model more information the other diagnostic we can do is we can look at the difference between our actual values and our predictions so that's just subtracting our predictions from our actual values and that gives us a difference and if we want to take a look at this and just in a little bit of a nicer way we can actually merge our the data frame so we can see the difference along with all the other stats for a player so we can look at predictions dot merge batting left index equals true right index equals true so this will use the index to actually merge our predictions with the batting data frame and then we can again create our diff column this time i will look at the absolute value of the difference because we just care about the difference not necessarily the direction and then we can just take a look at merged and we'll be able to see the actual the prediction at the very end the difference between the two and all of the other factors about the player and this can help us to see which players are being systematically miscategorized and maybe there's some information we can add to the model that will help and we can do a little bit of filtering of columns just to make this a little bit easier to to read it's a little janky if you have to look at all the columns okay and then we can also sort the values by the difference [Music] right so this shows us the players that Wrap-up and next steps with the model we predicted most effectively and least effectively so for some players we were really really close to the actual value for some players we were really far off and this looks like mostly players who maybe they're rookie season when they're just breaking out i think it could also be coming back after injury but these are these are star players who typically were just miscategorized for some reason okay so from looking at this potentially figuring out how to handle injuries better could help because right now we have a plate appearance threshold so we're only looking at players who had more than 200 plate appearances so if someone's injured maybe there's a way to fill in their stats or indicate they were injured in a given season which is why their war was low which could affect predictions we could also potentially add in data from the minor leagues so that a player's first season can be accurately predicted and then we have some sort of historical data there all right so we've done a lot in this video we started out with nothing no data we downloaded all the data cleaned it up got it ready for machine learning we used sequential feature selection to pick the best features for our model we built a back testing system to actually create good predictions and then we evaluated the error of our predictions and then improved them and i gave you a couple of next steps you can take so using better data like using data from the minor leagues or data about player injuries you could also try a few more techniques so in the player history we did a few things where we tried to look at the trend of someone's war really looked at this linearly though and a lot of players have kind of a peak and then a slump so you could try to look at it in a different way to try to take that curve into account or which side of the curve is the player on you could also use a different model a ridge regression runs really fast but it isn't always the most accurate model you could also try different feature selection strategies to try to pick different features but generally your best bet will be in adding better data or doing better feature engineering to create better features all right i hope you enjoyed this walkthrough thanks a lot

Share your thoughts

Related Transcripts

Top 5 AI Updates From Apple Event 2024 thumbnail
Top 5 AI Updates From Apple Event 2024

Category: Education

Okay so here are top five ai updates from apple's event first one is custom emoes creation users can now create original emojis termed zen mosi by typing a description or selecting a photo of a friend or family member this feature aims to personalize the emoji experience significantly second one is... Read more

Congress races to pass $1.2 trillion in spending before shutdown deadline thumbnail
Congress races to pass $1.2 trillion in spending before shutdown deadline

Category: Science & Technology

As congress prepares to vote on a $1.2 trillion spending package tensions rise as the deadline looms will they avert a shutdown the senate has limited time to vote on the spending package risking a partial shutdown will they beat the clock some republican senators are posing challenges to the bill causing... Read more

Government shutdown deadline: House expected to vote on key government funding legislation thumbnail
Government shutdown deadline: House expected to vote on key government funding legislation

Category: Science & Technology

As the house rushes to vote on crucial government spending tensions rise with the looming threat of a shutdown the bill covers defense homeland security and key departments vital for national security and public safety if passed the bill moves to the senate for approval adding pressure to the tight... Read more

Marcel Ribeiro-Dantas: DataOps/MLOps with DVC (ENG) thumbnail
Marcel Ribeiro-Dantas: DataOps/MLOps with DVC (ENG)

Category: Entertainment

Hi there my name is marcel and i'm going to talk about dvc today with you so currently i'm an early stage researcher at the kiwi institute in france but before i also worked at lace which is a laboratory for healthcare innovation in brazil even though it it's it's located inside a university it works... Read more

Congress rushes to approve final package of spending bills before shutdown deadline thumbnail
Congress rushes to approve final package of spending bills before shutdown deadline

Category: Science & Technology

As the clock ticks down lawmakers scramble to pass the final spending package for the current budget year avoiding a potential government shutdown the $1.2 trillion measure combines six annual spending bills with over 70% allocated to defense sparking intense debate and negotiation the house and senate... Read more

Dmitry Petrov: DataOps & ML automation with DVC (ENG) thumbnail
Dmitry Petrov: DataOps & ML automation with DVC (ENG)

Category: Entertainment

Hello everyone my name is dmitry i'm going to talk about data ops and ml automation with dvc uh the first part of the talk is going to be about uh uh principle behind building ai platforms and ml automation so this part is going to be a little bit abstract i'm very sorry about that while the second... Read more

New ways to search: Beetlejuice Beetlejuice (ft. Bob) thumbnail
New ways to search: Beetlejuice Beetlejuice (ft. Bob)

Category: Science & Technology

[upbeat music] [grunts to speak] Read more

Training the largest LLMs, Cerebras Wafer-Scale Architecture | Keynote 3 | Jean-Philippe Fricker thumbnail
Training the largest LLMs, Cerebras Wafer-Scale Architecture | Keynote 3 | Jean-Philippe Fricker

Category: Science & Technology

[music] thank you thank you so much and thank you for attending this session i know you had uh maybe four days of uh hard labor uh i'm going to try to show you how your labor could be somewhat simplified with what we've done at cerebras um so yes my name is jean philip freer and as you know when you... Read more

Unlocking AI's Potential: How OpenAI's New Model Uses Chain-of-Thought Prompting! thumbnail
Unlocking AI's Potential: How OpenAI's New Model Uses Chain-of-Thought Prompting!

Category: People & Blogs

Open a i just released their most advanced ai model yet it's called 01 and it's designed to think stepbystep just like a human to solve complex problems in science math and coding owan can perform at phd level tackling tasks that were previously to difficult for ai some key things that set 01 apart... Read more

AI News: OpenAI Finally Released Their New Model! thumbnail
AI News: OpenAI Finally Released Their New Model!

Category: Science & Technology

Intro i just spent the last week at disneyland and of course the week that i'm gone turns out to be an insane week with tons of big announcements i'm a day later than normal on getting this ai news video out so i'm not going to waste your time let's just jump right in there was really two major major... Read more

iPhone 16 Pro Price, Design Update And Battery Life Confirmed In Apple Launch thumbnail
iPhone 16 Pro Price, Design Update And Battery Life Confirmed In Apple Launch

Category: Science & Technology

Newly developed 3 nanm a8 chip this chip represents a significant leap in performance boasting a six core cpu with four efficiency cores and a dramatically improved neural network apple claims that the a18 chip is up to 30% faster than the a17 found in the iphone 15 while also being 35% more power efficient... Read more

OpenAI o1: A Major Step Forward in AI Reasoning and Technology thumbnail
OpenAI o1: A Major Step Forward in AI Reasoning and Technology

Category: Science & Technology

Ai just took a major leap forward open ai1 is changing the game in artificial intelligence reasoning this new model is designed to think more like humans making smarter decisions and solving complex problems faster than ever before from understanding context better to improving how ai tackles real world... Read more