Time Series Forecasting With XGBoost: Predicting Energy Consumption

1/12/2023
4-minute read

A project from DATA 602 course at Drew University.

Project Overview: This project demonstrates the creation of a sophisticated Python-based machine learning model using XGBoost, specifically designed to predict energy consumption in Germany. By focusing on five years of detailed data (DE columns), the model provides precise insights into German energy usage trends.

Key Features: Link to my notebook colab

Specialized Dataset: Centered on the German energy sector, the model harnesses data from the DE columns of a comprehensive dataset, available here, ensuring targeted and relevant predictions.
Time Series Analysis in Focus: Applied advanced time series analysis techniques to decipher the unique patterns and characteristics of German energy consumption data.
Data Pre-processing and Feature Engineering: Conducted thorough data pre-processing and feature engineering, tailoring these processes to the specifics of the German energy dataset, thereby enhancing the model’s accuracy and relevance.
XGBoost for Precision: Utilized the robust XGBoost algorithm, recognized for its outstanding performance in various machine learning applications, particularly in scenarios involving complex datasets.
Visualizing German Energy Trends: Presented the findings through compelling data visualizations, offering clear insights into the nuances of Germany’s energy consumption patterns.

I began this project by diving into the data loading and cleaning process, which was both intricate and crucial. The dataset, featuring 15-minute intervals, presented the challenge of missing values. To address this, I employed backfill and interpolation techniques, ensuring the data’s integrity and usability. This set the stage for my in-depth time series analysis.

I resampled the data across various time frames - hourly, daily, weekly, and monthly. This approach was instrumental in revealing the underlying trends in energy generation from wind and solar sources, as well as patterns in energy consumption.

A particularly fascinating aspect of my analysis was exploring the correlation between solar energy generation and the daylight trends specific to Germany. This exploration not only highlighted the nuances of renewable energy sources but also shed light on the potential and limitations of solar power in the German context. When I first laid eyes on the dataset, I was dealing with a jigsaw puzzle of energy numbers, all crammed into 15-minute slots. It felt like trying to read a novel through a keyhole.

By applying a seasonal decomposition over a 7-day period, the trend became much clearer, revealing the ebb and flow of energy consumption with precision.

This clarity was crucial as I constructed a predictive model. I chose the TimeSeriesSplit from sklearn for its effectiveness in validating time series forecasts, ensuring a robust evaluation by separating the dataset into distinct training and testing sets. My exploration of forecasting methods was thorough, ranging from naive approaches to sophisticated exponential smoothing techniques.

The comparison of methods was illuminating:

The Naive method served as a simple benchmark but with notable room for improvement.
The Simple average method improved upon the Naive, though it still left some accuracy to be desired.
The Simple moving average forecast offered a dynamic look at trends but fluctuated with a higher margin of error.
The Simple exponential smoothing forecast honed in on leveling trends, achieving a better balance between accuracy and responsiveness.
Finally, Holt’s Winters additive method with trend and seasonality, with its incorporation of both trend and seasonality, yielded the most promising results, significantly reducing error margins and offering a model that could be trusted for its foresight.

Still having some skeptical about all, I decided to give SARIMA a shot. To my surprise, it worked out really well. The predictions were on point, way better than I expected.

When it came to predicting with XGBoost, I started off by sticking to the basics, using the default machine learning settings. I tried tweaking the settings here and there, doing hyperparameter tuning, but it felt like I was getting nowhere, even after adding feature for the dataset with daily lags. I was aiming to forecast next year’s energy consumption based on the last five years of data, but it turned out I was a bit out of my depth, just spinning my wheels with hyperparameter tuning.

Then, my professor dropped tip: why not just use the data from the most recent year? I was skeptical, but I took the advice and It worked wonders. Skipping the lag features and choosing the exact hourly features, based on what indicated in the feature importance selection, made all the difference. By the end of the semester, I had learned something priceless about machine learning – sometimes less is more, and the right data is better than more data. Thank you for visiting this site. Happy reading!

Timeseries Machine learning