Time Series Forecasting with XGBoost - Use python and machine learning to predict energy consumption

Intro Hey YouTube, in this video we're going to talk about time series forecast using machine learning and Python. Time series forecasting is a very common problem that you face as a data scientist where you have historic data and you want to predict into the future. In this video we are specifically going to be using a machine learning algorithm called XGBoost and depending who you asked, oh man, many believe XGBoost is one of the best out of box machine learning models to use on tabular data and even time series problems like this. What's even better, we're going to be working completely in a Kaggle notebook so you can just click the copy button on the top of it and if you have a Kaggle account you can edit the same code that we're working with today and explore it yourself. My name is Rob, I make videos about data science, machine learning, and coding in Python. If you enjoyed this video please consider liking and subscribing, it really encourages me to keep on making videos like this in the future. Okay, let's get into the code. So a little background on the data set that we're going to be using for this tutorial and we're going to be using an hourly energy consumption data set that I actually uploaded to Kaggle a while ago. It's been very popular and what it has is energy consumptions for different regions in a portion of the country and we have those values at an hourly basis for over 10 years. So here we are in a Kaggle notebook and if I look over on the right side I can show you that we have this hourly energy consumption data set imported that we'll be using a little bit later. But let's get started by doing some imports. So we're going to import pandas as pd, import numpy as np, import matplotlib, pyplot as plt, and let's also import cborn as sns. And then for our modeling we're going to import xgboost as xgb. This is going to be the model that we use for our forecasting. Now before we even get into the data we need to talk about how there are different types of time series data. Now if we had data time series data that was just completely random there'd be no point in modeling at all but there can also be other trends in our data that we want to account for. There can be exponential growth something like you might see in the stock market, increasing or decreasing linear trend, seasonal patterns, and also there can be a combination of any of these. So seasonal patterns with growth. The type of data that we're going to be looking at today we will see is mainly seasonal. Now sometimes people refer to this as the time series being stationary or non-stationary but for more most time series it won't actually fall into exactly one bucket and this xgboost model that we're going to use works pretty well with changes to the data over time but you're going to have to account for this depending on what your data set looks like. Data prep Now let's go ahead and read in this data set by using pandas read csv. We're going to open up the hourly energy consumption and we're going to read pgm east hourly and call this df for data frame. If we quickly run just a head command on this we can see the first few rows and we see that we have going back to 2002 the hourly energy consumption value and if I run a tail command on it I can see it goes all the way up to 2018. With pandas it's pretty common to set our index to be the date time column since that will be consistent in this time series data set and let's go ahead and save that off and then it'll be it's a good idea to actually visualize it so let's make this a plot with a style of dots instead of a line. Let's give it a figure size. Let's also pull the color palette from seaborne so we can use that when we're plotting and put this up here by our imports and we will just make the color the first color in this color palette. We can put this up here and maybe give it a title pjm east energy use in megawatts and we can show it. Now another thing I'm noticing is it looks like this data frame index is just an object type so we might want to cast this as a date time by running pandas to date time on the index and now our our index d type is actually a date time 64 type which is much better for this case instead of just having this be a string. We'll do this when we load in our data and now if we plot it the x-axis looks a little bit easier to read because it's formatted as a date time instead of a string value. I'm just going to go ahead here and split these lines to make it a little easier to read and then we're going to go into the train test split. Now if you're really building a model that you're going to productionize there are some ways to do full cross validation in a time series data set. We're not going to go into that in detail in this video but I do have another video about cross validation that you should check out but here just for learning sake we are going to take our data and we are going to split on the date January 2015 and have everything prior to January 2015 be our training data and keep our test data as the following dates. So we can do that pretty easily by taking the index and finding where it's less than January 1st 2015 locating those rows in our data set and then calling this train and similarly we'll call this test if it's greater than or equal to those dates. Just to visualize this I'm going to actually plot both of these in the same plot by making a matplotlib subplot. We're going to plot train and we're going to plot test and we'll add labels to each so the first one is training set and the second one is test set and we'll do plot.show. The colors are going to be different and we can see here that we have split here January 2015. Actually let's just make a line there too. We'll make a line there on that date with the color being black and the style line style being dashed lines and I'm going to add in a legend so this name actually looks correct. Let's add this title here that I'll call this train test split. Beautiful. Now another thing we might want to look at while we're exploring this data is just to get an idea of what's one single week of data looks like. Let's take January 1st 2010 and up until January 8th 2010. This should just be one week of data and let's go ahead and plot this. So we can notice a few things here. It looks like within each day there are two different peaks. This is pretty common in energy consumption and there are also valleys during the nights. It also looks like you have a weekend effect here where one of these days actually January 1st would even be a holiday will be affected by that day either being a weekday or a weekend. So that brings us to our next step Feature creation which is feature creation. We're going to create some features with this data using the time series index. Luckily pandas makes this very easy for us. So if we take just the index we see here we have a list of all the dates but we can actually use the dot hour on this and we'll get a number value for each of these dates which is just the hour component. So we're going to go ahead and add this as a new column in our data set called hour. We're going to do the same thing for the day of week by doing df index dot day of week. Now these values will start I think as a Monday but we can always look these up in the documentation. You see that day of week here Monday is a zero and Sunday is a six. We can pull out the quarter which will be splitting the year into four different groups and then of course the month we can do the year we can even do the day of year. So let's go ahead and add these in as features and just to clean this up we are going to make this into a function called create features that will take in a data frame and return the data frame with the features added. Let's also give it a quick doc string that says create time series features based on time series index and we'll run this function on our data frame. Now let's go ahead and visualize our feature to target relationship. Now one of the ways we can visualize our feature versus our target is by using seaborne's box plots. Box plots are nice because they give you an idea of the distribution of the data set. So we're going to give it the data of this data frame. Our x variable is going to be the hour and our y variable is going to be PJM East megawatt and let's give this a bigger fig size and give it a title megawatts by hour. So we can see here that early in the morning there seems to be a dip in energy use and it tends to get higher later in the evening. Now we can do the same thing with month. Let's give it a different color palette and there we can see that the megawatt usage by month tends to peak here two times in the winter season, then in the fall and spring it has lower and another peak in the middle of summer when everyone's running their AC units. Okay now that we've created features and we know our target and we haven't some idea of the relationship between the two, we are going to create our model. We're going to create our model based on the training data and evaluate it on the test data set. So let's actually import a metric. I forgot to do this earlier but from from sklearn metrics we're going to import mean mean squared error as our metric. Mean squared error will give us more penalty for any predictions that's way off versus just a little bit off but the type of metric you might want to use for your data set will really depend on what you're looking to do. Now this is a regression Model task so we're going to create a regression model using xgboost's regressor, xgb regressor. Now there are a lot of things that you can tune in xgboost but we're going to start with the number of estimators. That's basically just how many trees this boosted tree algorithm will create. We're going to set that to a thousand and then we're going to go ahead and fit this on our training set. But before we do we need to take our training and test data set and run them through the create features function. I'm going to add df copy here to create features. That'll make sure that we're actually editing a copy of our data frame when we run it through here. It'll get rid of that error and then let's also define our features which are all of the columns that we created. Time series features and our target which is this pjm east megawatts. Now we're going to actually make a features data set from our training data set and call it x train and that's just going to be all of the features from our training data set. We're also going to make a y train which is our target and that's just going to be the target column from our data set. And we're going to do the same thing with our test. Now we can feed this through our model. So we'll give this our fit method takes x train and y train and we're actually going to give it a valuation set which is going to be both our x train and x test y test. We're going to have the model training stop early if the test set does not improve after 50 trees and we're going to make this verbose. So actually it's saying that we need to put the early stopping rounds here when we create the early stopping rounds here when we create the model itself. Now as we train here we can see that the root mean squared error as trains are being added to the model on the training set begin to go down and also the root mean square error value on the test says or validation set starts to go down but then the validation set starts to get worse and this is overfitting and that's what we would like to avoid. Early stopping will stop our model training once it sees this occur since we've given an evaluation set but also another thing we can do is lower our learning rate to make sure that it doesn't overfit too quickly. Let's try this again and when we actually provided this verbose we can tell it give it a number instead of true that'll tell us just to print out the training and validation score every 100 trees that are built. Now you can see here it stopped after 436 trees that's because our test or validation set started to actually get worse after that many trees were built. Now one of the nice things about our model now that it's trained with xgboost we Feature Importance couldn't check out the feature importances and we do that by just running feature importances off of the regressor that we've created and that'll give us the importance values based on how much these features were used in each of the trees built by the xgboost model but these values by themselves aren't very helpful so let's make a pandas data frame where the data is these features importances and our index is the feature names. Let's also call the columns as importance and call this data frame fi for short. We can sort the values by importance and plot as a horizontal bar plot with the title of feature importance. Put this up here and now we can see that the model has really been using the hour feature and the month feature the day of week and day of year feature less and then year is down here at the bottom. There's some overlap in these types of features if we removed month day of year would just be used in its place so keep in mind that when you have highly correlated features this feature importance functionality really will not tell you exactly how important each feature is individually more so as a collective in this complete model. There are other packages out there for exploring feature importances more but this gives us a good idea of what our model is using. All right now we're going to forecast on the test set with our trained model. We can do this simply with just taking our regressor and doing predict on our x test set and we're provided a NumPy array of all the predictions for the test set. Forecast Let's take our test data and make a new column called prediction where we will store these predictions. Then because I would like to see these next to all of the training data let's merge this on the test set. Let's do how equals left and let's do left index is true and right index is true to say that we'll merge these two data frames on the index columns and we do that on the test set and we don't want to copy over all the features so we're just going to take this prediction column and merge it over. Now in our main data set that we first started with we now have a prediction column for our test set and if we plot PJM East and plot our prediction now we can plot our raw data and predictions. Let's give this a legend. So putting this all together we can see our predictions plotted on top of the training data set and similar to what we did before let's try to take a look at this one week of predictions but we're going to have to do 2018 because that's in the test set. So what I've done here is I'm plotting the predictions and the ground truth over one week and you can see that the model isn't perfect. There's a lot of improvement that can be made. Some ideas include doing better parameter tuning. We did not tune this model at all. We could also add in features for specific days of the year like holidays that might carry forward to either increase or decrease the energy use that it would predict for those days. There's a lot that can be done to make this better but you could see that our predictions on the test set in this week do actually follow the trend that you would expect to see going up and down having the dips during the the night times and we can even run our evaluation metric on this by using our test predictions. Well let's do mean squared error which takes first the true value and then the prediction so that would be our test PJM East megawatt and then our prediction. Now I'm actually going to run the square root of the mean squared error. This will get us the root mean squared error which is the same metric that we were using here rmsc when we're evaluating the model as it trained. So our root mean squared error on the test set is three three thousand seven hundred and fourteen. To improve this model we would want to reduce that score so I'm going to print this here and we're going to print out the score with let's do four decimal point points. There are two decimal points looks pretty good. Another thing we can do is just calculate the error. So let's take our test data set and our target value and subtract our predictions prediction and then let's take the absolute value of this so that the negative and positive don't matter but it'll just give us a general error value for each of our predictions and let's look at the worst and best predicted days. So what I'm going to do here in the test set is take the index and then take the date so that each date is going to have its own value and make that a new column and if we group by date take our error and the mean value this will give us the average error for each day that we've predicted and then if we just sort values all right if I do ascending equals false and then we take the head of the five then we can see that the worst predicted days all seem to be in the middle of august of 2016 and if I do the opposite way with ascending equals true we can see that some of the best predictions were made in 2016 as well so by calculating error we can then see which dates we actually predict for the worst and try to improve those going forward. Now in terms of next steps that you would want to do if you were actually running this yourselves would be to create a more robust cross validation to add more features if you could get them from external sources like maybe the weather forecast or holidays and add those in as features to the model and see how it improves things. Thanks so much for watching this quick tutorial on how to use machine learning for time series forecasting. If you like this video please consider liking and subscribing that way you'll get alerted every time I create a new video. Let me know in the comments if you have any feedback or things you'd like me to see make videos about in the future. See you next time!

Share your thoughts

Related Transcripts

Predict Baseball Stats using Machine Learning and Python thumbnail
Predict Baseball Stats using Machine Learning and Python

Category: Education

Introduction hi my name is vic and today we're going to learn how to predict future baseball stats given a player's historical data we'll start out by downloading baseball stats using python we'll clean the data and get it ready for machine learning then we'll pick which columns we want to use as predictors... Read more

Congress rushes to approve final package of spending bills before shutdown deadline thumbnail
Congress rushes to approve final package of spending bills before shutdown deadline

Category: Science & Technology

As the clock ticks down lawmakers scramble to pass the final spending package for the current budget year avoiding a potential government shutdown the $1.2 trillion measure combines six annual spending bills with over 70% allocated to defense sparking intense debate and negotiation the house and senate... Read more

Government shutdown deadline: House expected to vote on key government funding legislation thumbnail
Government shutdown deadline: House expected to vote on key government funding legislation

Category: Science & Technology

As the house rushes to vote on crucial government spending tensions rise with the looming threat of a shutdown the bill covers defense homeland security and key departments vital for national security and public safety if passed the bill moves to the senate for approval adding pressure to the tight... Read more

Top 5 AI Updates From Apple Event 2024 thumbnail
Top 5 AI Updates From Apple Event 2024

Category: Education

Okay so here are top five ai updates from apple's event first one is custom emoes creation users can now create original emojis termed zen mosi by typing a description or selecting a photo of a friend or family member this feature aims to personalize the emoji experience significantly second one is... Read more

Marcel Ribeiro-Dantas: DataOps/MLOps with DVC (ENG) thumbnail
Marcel Ribeiro-Dantas: DataOps/MLOps with DVC (ENG)

Category: Entertainment

Hi there my name is marcel and i'm going to talk about dvc today with you so currently i'm an early stage researcher at the kiwi institute in france but before i also worked at lace which is a laboratory for healthcare innovation in brazil even though it it's it's located inside a university it works... Read more

Dmitry Petrov: DataOps & ML automation with DVC (ENG) thumbnail
Dmitry Petrov: DataOps & ML automation with DVC (ENG)

Category: Entertainment

Hello everyone my name is dmitry i'm going to talk about data ops and ml automation with dvc uh the first part of the talk is going to be about uh uh principle behind building ai platforms and ml automation so this part is going to be a little bit abstract i'm very sorry about that while the second... Read more

Predict NFL Touchdowns - Create Your First Predictive Model in Python (Step by Step Tutorial) thumbnail
Predict NFL Touchdowns - Create Your First Predictive Model in Python (Step by Step Tutorial)

Category: Science & Technology

What up youtube it's your boy nick wayne i'm here with a youtube tutorial on how to create your first nfl analytics predictive model we're going to be predicting touchdowns it's gonna be really easy i'm gonna walk you all through it and if you want to follow me along you can follow not just this video... Read more

Congress races to pass $1.2 trillion in spending before shutdown deadline thumbnail
Congress races to pass $1.2 trillion in spending before shutdown deadline

Category: Science & Technology

As congress prepares to vote on a $1.2 trillion spending package tensions rise as the deadline looms will they avert a shutdown the senate has limited time to vote on the spending package risking a partial shutdown will they beat the clock some republican senators are posing challenges to the bill causing... Read more

iPhone 16 Pro Price, Design Update And Battery Life Confirmed In Apple Launch thumbnail
iPhone 16 Pro Price, Design Update And Battery Life Confirmed In Apple Launch

Category: Science & Technology

Newly developed 3 nanm a8 chip this chip represents a significant leap in performance boasting a six core cpu with four efficiency cores and a dramatically improved neural network apple claims that the a18 chip is up to 30% faster than the a17 found in the iphone 15 while also being 35% more power efficient... Read more

I Predicted the Apple Stock Prices with AI - You Won’t Believe the Results! 📈 thumbnail
I Predicted the Apple Stock Prices with AI - You Won’t Believe the Results! 📈

Category: Science & Technology

Hey everyone welcome back to the channel today we're diving deep into the world of stock price prediction specifically the apple stock price now before we start just a quick disclaimer this video is for educational purposes only stock price prediction can be fun and insightful but it's not financial... Read more

#Openai o1 Preview y o1 Mini son los nuevos modelos de #InteligenciaArtificial en #CHATGPT thumbnail
#Openai o1 Preview y o1 Mini son los nuevos modelos de #InteligenciaArtificial en #CHATGPT

Category: Education

Open eye lanzó un nuevo modelo de inteligencia artificial dice presentamos open o1 hemos desarrollado una nueva serie de modelos de ia diseñados para pasar más tiempo pensando antes que respondan es decir es diferente en el sentido que lo que se venía viendo en todos los modelos de lenguaje artificial... Read more

Apple Event 2024 in 60 seconds thumbnail
Apple Event 2024 in 60 seconds

Category: People & Blogs

Watched apple about apple for apple with apple apple how applect apple with apple say apple apple for apple then apple is apple for apple of apple apple apple app charging apple app custom app apple app app app app app appable apple app app of app app appable app apple design apple popular apple the... Read more