Predict NFL Touchdowns - Create Your First Predictive Model in Python (Step by Step Tutorial)

what up YouTube it's your boy Nick Wayne I'm here with a YouTube tutorial on how to create your first NFL analytics predictive model we're going to be predicting touchdowns it's gonna be really easy I'm gonna walk you all through it and if you want to follow me along you can follow not just this video but you can open it up on the GitHub link it's probably down here somewhere and if you don't want to use the GitHub link uh you just want to hear my voice and hear me talk about it you can do this also uh conceptually you can bring this over into art you can bring this over to rest you can bring those over to Julia you can bring this over to wherever I'm sure it's gonna work perfectly there too so let's get started you can see here we're starting off with uh where the data is you can actually go to this link right here and download the data uh there are a few different ways to download this data uh in my opinion if you just download the data from this website you could click it you could uh set up some sort of uh get pull for the data whatever you want to do I'm not going to go over that in this tutorial so I'm just going to assume you have the this data somewhere in your Google Drive so I'm going to be using collab for this and if you're not using column that's okay this can work in your Jupiter lab or Jupiter lab you could do this in Jupiter you could do this in a vs code whatever you want to do this in so long as you have your pads right and you know what I'm talking about where you got to go retrieve your data then you should be fine some packages that we're going to load in initially uh because I'm working Google collab I'm pulling my data out of my Google Drive so I'm going to need OS to access my operating system that way I can go and locate where my files are and then bring them in we're going to be using pandas and numpy these are packages that allow you to bring in data in a in any in a lot of different formats uh we're using csvs today so we're going to be using a ton of pandas we're gonna be using matplotlib and Seaborn for our data visualization this will help us just keep our head on straight when we're looking at different correlations or different different Trends in the data once we have our data loaded in so we're opening up our data uh what I'm doing is I want to navigate to my project directory here my project directory in this case is in my Google Drive if you are using Google Drive Drive slash my drive is how you get there if you if you're getting some sort of weird error that says like there's no location you might want to check to see if you've mounted your your drive and that's a button right here this button right here will say Mount Drive if it does an X through it it means it's mounted already I'm going to be looking at my Google Drive and I actually want to explore uh my drive so I'm going to type in Project directory and a few a few different layers of folders here this is just like navigating something on your desktop navigating something on your local computer this is just going to a folder and I'm listing out these different folders you can see that each of these folders here we have over 20 seasons of data in all of these different folders inside one of these folders if we go to a single season folder you'll see that we actually have play-by-play data so in this case we actually went into uh this season equals 19.99 folder and inside that folder I'm finding a play-by-play 1999 CSV so that's this common separated value this is just tabular data this is data that we see that you can load into Excel you can load this into Matlab you could listen to SPSS you can also load this here in Python in pandas for our purposes we're just going to be using three years of data not all 20 bajillion years of data we're just going to use three years 2019 2020 and 2021 uh by the way if you know a better way to write this uh list comprehension type it down there in the comments section I I don't know a better way to search for individual strings like this there's probably a better way uh but but I kind of shotgun this tutorial also I don't really know off the top of my head with the best way so if you have a better way just let me know in the comments when I'm loading my files the first thing I need to do is I need to actually say what the entire path is and remember our path is in Google Drive so we're using drive my drive that's accessing my Google Drive uh I have a a bunch of different folders a bunch of different parameters for my folders here A bunch of different parent folders for where my data lives and then I access my data but I'm not going to know uh just based off the file name where my data is I'm going to need the full path so I I put in all the full path here that's what this data files does once you have your path to your files you're going to open them now in this case the way we open them uh we have to do a few different Subs the first one is we have to initialize your data frame the reason you do that is because your data doesn't know where to load into right if you just open a data frame up it's it's going to open into a data frame but it doesn't know that you're trying to append a bunch of stuff to that so the easiest way to do this is to just initialize a data frame first and then we're going to Loop through our files and then we're just going to append or add or concatenate whatever you want to call it it's a append in uh in in pandas we're gonna append or join our Union uh to the bottom of this data frame so for each CSV we read in we're just going to add that to the bottom of our data frame and that's what this does so we append and we pump out this new data frame but because each CSV that comes out has their own index like zero to let's say 30 000 we don't want repeats uh repeats in the in the row names or in the the index numbers so we're gonna we're gonna reset that Index right here if you don't know why that's happening or if you don't don't understand like why that's important so that's okay uh you're not going to really run into any issues with this in this uh tutorial uh but in other things you might have issues with uh duplicate indices and so this is just an easy way to avoid that once you load in all your data I also print my data frame just to see how much data we've loaded in in this case for three seasons of data we're looking at uh just shy of 150 000 rows so if you're trying to do this in Excel or you're trying to do this in a uh in a program that might not support a lot of rows like this then you might have a lot of slowness when you're doing this in Python you're going to find out that this is actually pretty quick to load in all your data and also uh uh work your way through the data as we're about to do and as we explore the data so once the data is loaded in you can actually explore a little I I look at uh the the header the top 10 rows you can see the top 10 rows that we've loaded in here I also look at just a sample a random sample of five rows sometimes the top 10 might be a little misleading I do like to look uh at a random sample of rows just to get a better idea of the these values uh if we have any missing data or if some data looks different than other rows uh but in this case it all looks pretty good and then at the same time I also want to see how many columns we're working with and in this case as the uh the shape of the data set says and down here uh for the head and where the samples say there's 372 columns and that's a lot of columns so if you're overwhelmed by the amount of data that we're loading in uh don't worry about that because in the next section we're going to talk a little more about a smaller set of data that we're going to be using related specifically to quarterbacks okay so if you made it this far we've loaded in a bunch of data it's all in our pandas data frame now we're just going to explore our data and see what stuff correlates with touchdowns so the project that I'm thinking we should do is just see what how many how many touchdowns is a quarterback gonna throw I want to take a look at a few different things about a quarterback but the first thing we got to do since this is play by play data it's not over the course of a season data we're gonna need to aggregate or group by our data so we're going to take a bunch of different quarterback features in this case the season who the who the player is who the passer is the quarterback is uh and some of the statistics that we care about so passing whether they were intercepted or sacked the amount of yards gained when they threw the ball and then of course how many touchdowns they throw if you're not familiar with group bias statements there's probably a great tutorial out there that talks all about group bias in this case if you are a little familiar we're grouping by the season and the play player who in this case the quarterback so if you can imagine in your head we're taking the season and the quarterback and we're just adding together all these different statistics in this case how many passes they threw in the season uh when he touched down the quarterback during the season Etc you could see that group by its statement right here all we're doing is taking those quarterback features we set in the list up here and we're also grouping by these Group by features we set above as well you can see right here this is a sum we're aggregating we're summing over each of these Seasons what I've done with that I also take a look at the sample so I just again print out a random sample of 10 and I could see uh how many touchdowns of particular uh quarterback through like here Nate Mullins in 2020 through 15 touchdowns whereas Matt Barkley threw one touchdown so after we created this data frame what I want to know is what correlates with touchdowns because at the end of this we want to we want to make some sort of prediction on touchdowns and we could see that over the course of one season yards gain predictions predicts touchdowns uh the amount of completed passes predicts touchdowns the amount of passes thrown in general predicts touchdowns and some other things predict touchdowns which are are a little interesting but it makes sense if you think about it the number of interceptions thrown correlates with touchdowns and also the number of sacks uh received predict status and it kind of makes sense because the more past attempts you're throwing the more likely you're going to throw interceptions and also the more sacks you you have on you more likely you're the the primary quarterback you're not just like a random person throwing a trick play uh so you're most likely the a quarterback uh who's trying to make passes uh and of course the more sacks you have it's not good but it is maybe an indication that you're the primary quarterback okay so that's all interesting but that's if we know what happened in one season already so for instance if we know the amount of yards gained in the season then we could predict touchdowns pretty well but what we want to know is the future we want to know if a Play What a player did last season does that correlate with what they're going to do the next season so we're looking for year over year Trends so in order to get that information we're going to have to do one more manipulation of our data we're going to take our data set we're going to copy our our data set we just created the next thing we're going to do is we're going to take that season variable that's in there and we're going to add one to it and the reason we do that is because we're treating our information for one season as the previous season's information and that's going to make sense when we join if that doesn't make sense right now so now we're going to create a new data frame we're going to merge this data set which has season plus one we're going to merge that on season and passer so this is just our identifying information for the quarterback season we're going to label our suffixes and what's Happening Here is this data set is going to get no suffix and the data set that I'm merging this new data State this is the previous season's data right so we merge on season it's looking for this column and because the the years match and this is offset by one year all of this information in this copied data frame here is actually from the previous season so uh we're calling this underscore prep or or this should indicate what the previous season is and we're going to use a left join instead of an inner join because the left join is going to allow us to also see uh Seasons where we didn't have previous data or Seasons where we don't have uh predictions for yet so if we do another sample of our new data frame you'll see that we have all the information that we typically had in that original data set with past complete pass interception sex yards gain the touchdown but now we see a bunch of columns after it which which has the uh the pre or the suffix prev or a previous season so what the player did in the the season last and sometimes you're gonna get nands or null values because we only brought in 2019 2020 and 2021 so whenever you see 2019 since we don't have 2018 data it's gonna say Nan but for a season like Mitch trubrisky uh in 2020 we actually do have data from Mitch dubrinsky in the previous season so you can see here image too risky through uh 19 touchdowns and then the season before that in 2019 he threw 17 touchdowns okay so now that we have a new data frame with previous season information we're going to be able to actually look at year-to-year correlation that's to say did the thing in the previous season help predict the touchdowns for The Following season you can see that there's actually it's not as clean as the the correlations we saw before but there is correlations here so for instance uh the amount of touchdowns that you do the previous season does have some correlation with the touchdowns you're going to throw in the next season uh the yards gained in the previous season does correlate with touchdowns you can see that with complete pass uh the amount of passes you through the last season and you can see some weaker uh correlations for interceptions here whereas in our previous correlation we saw it was like pretty strongly correlated with touchdowns uh when we're talking about interceptions through the year before that doesn't correlate with touchdowns that well and then we could also see sax do have relatively weak but there is some correlation here for sex as well all right so you made it this far we've loaded in our data we've taken an exploratory look at our data we've joined and made some new information so our data can now look at year-to-year correlations we've explored year-to-year correlations now we're going to make our predictive model our machine learning model and that's how we're going to do it first we're going to bring in some new packages uh the new practices we're going to load in are from scikit-learn and scipy these are things that you should already have if you're using say the Anaconda distribution or if you're using something like collab like what I'm using if you have to install this feel free to install it it's just a pip install or if you're using condo you could probably kind of install this stuff so we'll take a brief look at what our data set looks like that we're going to use for our model and again we're trying to predict touchdowns and we're going to use previous season information to predict the next season's worth of touchdowns and so our data frame is set up for that we see that we have touchdowns here and we have a bunch of columns as they underscore prev which is indicating the previous season performance um we're gonna use something called a train test split so if you're not familiar with a trained test split is code Basics does have a pretty good tutorial uh and and theory on what a train test split is this is a really strong important principle of machine learning uh tldr you don't want to trait test on the things you train on because you're going to get over fit models so if you're not familiar with it uh I would say brush up on your machine learning a little and and uh come back to this when you do understand a little about train test splitting all right so again we're going to use previous year performance to predict touchdowns and so we have our feature set here uh these are all the previous performance statistics that quarterbacks had and we are trying to Target or we're trying to predict touchdowns sometimes you see this as uh Y and X sometimes you see this in different ways I'm using feature and Target uh you might see this in a different stylings or formats on kaggle or on other Jupiter notebooks uh but uh I'm using features and targets here we're going to create a subset of data called Model data and that's because not all of our features and targets have values right remember when people have 2019 data we're going to have a lot of null values for the previous data because we didn't bring in 2018. so we're going to try to eliminate all of our uh values that don't have numbers or their null null values we're gonna take that out of all of our data set and that's what let this drop N8 does so we're going to drop all the nulls in our data set and the subset is anything that's in our feature set and any of our targets that are notes so we're going to set up our train data set and that's looking at our 2020 data set and so we're just going to say let's look at our season and all the seasons that start with 2020 we're going to use that as our training data and then for test data we're going to look at only 2021 so we're training on 2020 data so basically we're saying the previous season 2019 what did someone do in 2020 and then in our test data we're going to give them the 2020 data the previous year of 2021. the 2020 data what do they do in 2020 how is therefore it's going to be in 2021. the next part is probably the easiest part about machine learning a lot of people think like oh this must take a lot of coding to to just begin machine learning it is one line and here it is a model equals linear regression that's all you need to do to initialize your first machine learning model the next thing you want to do is you want to fit or train your model on data and in this case the data that we're training on is that trained data set that we uh specified before that's that 2020 data and the features that were uh we're going to train on are the features that we also Define previously here this is all the previous season uh statistics from quarterbacks and our Target is touchdowns for the current season that we're looking at so in this case 2020 touchdowns we're going to use 2019 data to predict 2020 touchdowns this next part is the prediction this is kind of the exciting part for machine learning you're going to take your test data data that your model hasn't seen before it wasn't trained on this data and we're going to use the same features so previous season features but remember the model hasn't seen 2020 data so it just knows about 2019 and it's predicted 2020. so now we're going to see if this holds up for the next season in 2021 we're going to use our test data train or we're going to use 2020 previous performance to predict 2021 performance and we're going to put that into a variable called preds or predictions whatever you create predictions something that's really safe to do especially if you're using pandas for all of your analysis um is to make sure you're setting your index back onto your your predictions so when these predictions come out originally they're going to come out as a numpy array and that numpy array doesn't have any of your pandas index values on it all right so an easy way to kind of scramble up your predictions in a way that you might not want to is if you're switching between something that's indexless versus something that does have an inferred index so a really safe thing that I do all the time is I always take my predictions I put them into a series and I set my index if that's a little confusing I do recommend a tutorial on pandas data frames I have one on my channel you could take that take a look at that on sliced Basics after you make your predictions and once you set your index to your pandas uh series here you could actually just set that equal to your test data set and you don't need a join you don't need to merge steam it's just going to implicitly find the index values and it's going to insert those values into each of those rows once we have trained our model once we've predicted our model once we've sent that model predictions set it back to our test data set we're functionally done with all the machine learning all we got to do now is the statistics part statistics for linear regression a pretty straightforward you're going to be using two statistics for the most part there's a lot of Statistics out there you can use for linear regressions or uh for any machine learning uh model metrics that you want to choose there's a ton of them out there but for linear regression two that people commonly use is root mean square error and r squared which is just a Pearson r squared so now the root mean squared error is mean square error and you take the square root of it or you can raise it to the 0.5 that gets like I mean squirter and what we want to see is a really low number for that for r squared uh we're taking Pearson R in this case you could use Spearman R if you think your model is a non-parametric but if you don't know what either of those terms mean then I do suggest taking a look at a parametric and non-parametric statistics if you don't care about L what this means if you're like you know I'm good is typically a fine choice for these problems so Puritan R in this case we're going to take our uh our actuals and we're looking at our predictives and we're going to uh uh uh square that we're going to raise that to a 2. and so we're going to have our rmsc and we're going to have an r squared and we're just going to print those responses so rmsa rmse in here is 8.35 roughly uh we're talking about a A plus or minus of eight touchdowns which sounds kind of big especially if you're trying to uh win a fantasy league or something uh you this is probably too much error but this is a pretty simple model so if you're trying to improve the model with different features or learning from more history instead of just three seasons there is 20 plus seasons of data out there to train on uh you could probably make this model a lot more uh errorless or lossless in this case so root mean square error is a measure of how much loss there is and then r squared in this case this is explain variance so we are explaining 70 of the variance with the features that we include in this model which is really good usually if you're able to explain 70 of a variance in any problem you're really on on to some strong signal there if we want to visualize our predictions uh and how they actually look compared to the actuals this is what the graph looks like the actual touchdowns or the x-axis and the predictions or the y-axis and you can see that it actually lines up pretty well even though there is a pretty strong uh deviations here uh like we're saying one guy's supposed to throw 30 touchdowns and he threw uh less than 20 touchdowns uh and other times where we're saying someone's gonna throw 25 touchdowns but they ended up throwing like near nearly 60 touchdowns for the most part we're on the line some of these are right on this line which which indicates some of the quarterbacks we actually guessed pretty well and if we take a look at the top 10 uh quarterbacks in terms of the touchdowns they threw we could actually see what their predictions looked like here so you can see here Matt Stafford that was this uh prediction here where we only predicted 28 touchdowns he ended up throwing 54 touchdowns this season and so that was pretty off but here's Pat Mahomes uh where we had him we predicted him to throw uh about 48 touchdowns and he ended up throwing 52 touchdowns uh so that would be uh this one right here where we're seeing 48 and he ended up 30 52. so a lot of these do get pretty close uh some of them are pretty far off but overall for our first simple predictive linear model we did pretty good so we used a really simple set of features from the previous performance of last year to predict the following Seasons touchdowns touchdown totals and we ended up having a decent model again you can improve this model in a million different ways you could use different features you could use more data you could use different modeling techniques like random forest or gradient boosted regressors you could even use completely different things that we we aren't even showing in this tutorial uh like different ways uh to set up different priors if you're going to use a Bayesian technique or different assumptions that you want to make in terms of how you hyper parameter tune through your model uh whatever you do you can definitely beat this model I'm interested to see how how close people can get to to these numbers is if you can beat me in rmsc or r squared uh I would be uh just list what you got below and if you have code post it so I could uh verify that's true you know no no lying in the comments section all right no lying the lying in YouTube comments sections thanks for hanging out with me through that tutorial if you liked that content and if you liked hanging out with me and my chat tonight then please like subscribe all that good YouTube stuff but if you really want to hang out with us and party come check us out over on Twitch that's twitch.tv slash Nick Wayne underscore data science the links all down below there also down below you can find the code to this tutorial it's on GitHub and you could also rip a clone of this notebook from collab especially if you're using something like NFL fast R which is what we got this data from uh you can absolutely get the data and have it the exact same structure uh you could find that code down below

Share your thoughts

Related Transcripts

Top 5 AI Updates From Apple Event 2024 thumbnail
Top 5 AI Updates From Apple Event 2024

Category: Education

Okay so here are top five ai updates from apple's event first one is custom emoes creation users can now create original emojis termed zen mosi by typing a description or selecting a photo of a friend or family member this feature aims to personalize the emoji experience significantly second one is... Read more

Let's give the slumping Phillies a STANDING OVATION! #phillies #mlb #baseball #worldseries #playoffs thumbnail
Let's give the slumping Phillies a STANDING OVATION! #phillies #mlb #baseball #worldseries #playoffs

Category: Sports

Everyone phillies fans i know this stretch has been absolutely horrid for the phillies for some reason they can only win against the dodgers they have not won a series since the allstar game except against the dodgers they just got smoked by the arizona diamondbacks this is just painful to see i know... Read more

How Lionel Messi’s Legacy Was Saved By Emiliano Martinez - Part 5 #messi #barca #argentina #ronaldo thumbnail
How Lionel Messi’s Legacy Was Saved By Emiliano Martinez - Part 5 #messi #barca #argentina #ronaldo

Category: Sports

In 2011 the cop america was hosted on home soil in argentina and expectations for a trophy were high argentina's campaign ended in the quarterfinals with a penalty shootout loss to uruguay and messi who provided three assists in the tournament was unable to lead his team past the stubborn ugu the 2014... Read more

New York Jets Predictions for their 2024 win total and over/under #2 thumbnail
New York Jets Predictions for their 2024 win total and over/under #2

Category: Sports

Which way are you leaning with the new york jets i think i'm gonna go over here k just because i think just having aaron rogers back there um yes coming off an achilles injury but just having him back there he he realizes he's not going to be as mobile as he once was um so i think he uh he he alone... Read more

Shohei Ohtani Locks Up Another MVP?!?! #sportsfacts #dodgers #mlb thumbnail
Shohei Ohtani Locks Up Another MVP?!?! #sportsfacts #dodgers #mlb

Category: Sports

Did you know dodgers dh sh otani became the fastest player in mlb history to record 40 home runs and 40 stolen bases on the season doing so in just 126 games 21 games faster than the previous record holder alfonso sorano like and subscribe for daily sports content Read more

How Lionel Messi’s Legacy Was Saved By Emiliano Martinez - Part 1 #messi #fifa #barcelona #argentina thumbnail
How Lionel Messi’s Legacy Was Saved By Emiliano Martinez - Part 1 #messi #fifa #barcelona #argentina

Category: Sports

Today's topic is how emiliano martinez saved leonel messi's goat legacy what if i told you that lonel messi a man that many considered to be the greatest player in the history of the sport owes the majority of his legacy in an argentina shirt to an unheralded goalkeeper that didn't make his international... Read more

Reacting to the REVEALS of the PWHL team NAMES and LOGOS thumbnail
Reacting to the REVEALS of the PWHL team NAMES and LOGOS

Category: Sports

Hey guys in this video we're going to watch the trailer for the new logos of every team in the pwhl i'm going to give uh what i think about them also and let's get right into it we're going to start with a scepter i don't know if i pronounce it correctly but i think the the name is nice so let's start... Read more

Bet THIS #WNBA Game Today! Sparks vs Fever 💸 Sept 4 [#shorts #sportsbetting] thumbnail
Bet THIS #WNBA Game Today! Sparks vs Fever 💸 Sept 4 [#shorts #sportsbetting]

Category: Sports

But i like the sparks here plus 12 and a half let's talk about it la actually won the most recent timeout that was in may even though the sparks are only 2 and8 in their last 10 games they're six and four against the spread they beat the new york liberty the best team in the wnba indiana you look at... Read more

The Big Cheese Miami Hurricanes NIL and Recruiting Report 090624 thumbnail
The Big Cheese Miami Hurricanes NIL and Recruiting Report 090624

Category: Sports

We step into the world of big dreams big moves and big money it's the big cheese miami recruiting an nil report with manny navaro after 40 years the big cheese remains dedicated to serving the finest italian dinners pizza pasta subs and salads all of their pasta is imported from italy and cooked fresh... Read more

Big O and Manny Navarro - Must Win for Mario Cristobal and Miami Hurricanes vs Florida Gators? thumbnail
Big O and Manny Navarro - Must Win for Mario Cristobal and Miami Hurricanes vs Florida Gators?

Category: Sports

This is the big o [music] show this is the big o [music] show how you feeling my man you feeling good feeling good brother can't wait to get this uh show on the road this weekend go up to gainesville and uh and watch the canes and gator uh finally play it feels like we've been talking about this game... Read more

Christian McCaffrey Out for Week 2?!?! Could He Be Heading to IR? thumbnail
Christian McCaffrey Out for Week 2?!?! Could He Be Heading to IR?

Category: Sports

All right after a shocking last minute scratch before their game monday night 49er star running back christian mccaffrey is in serious doubt now to play week two and potentially further beyond that the all proo running back has been nursing a calf injury most of training camp he's been limited in pretty... Read more

How Lionel Messi’s Legacy Was Saved By Emiliano Martinez - Part 2#messi #ronaldo #football #neymar thumbnail
How Lionel Messi’s Legacy Was Saved By Emiliano Martinez - Part 2#messi #ronaldo #football #neymar

Category: Sports

But first let's talk about emy martinez's unorthodox journey to becoming argentina's number one born in m del plata argentina martinez began his youth career with independent's youth team before going on trial at arsenal and signing with the london club at just 17 years old over the next 8 years he... Read more