Data collection and
pre-processing
Data
Sources: This study utilizes a combination of
publicly available and proprietary datasets to ensure a comprehensive analysis
of energy consumption patterns and sustainability indicators. Primary data
sources include energy consumption datasets from the U.S. Energy Information
Administration (EIA), the European Energy Exchange (EEX), and renewable energy
production datasets from the International Renewable Energy Agency (IRENA).
Additionally, smart meter datasets, containing high-frequency energy usage
records from residential, commercial, and industrial sectors, are incorporated
to capture fine-grained consumption behaviours. Climate-related data, such as
temperature, humidity, and solar irradiance, are also collected from sources
like NASA’s POWER Project to support renewable energy forecasting models. Where
necessary, synthetic datasets are generated through data augmentation
techniques to simulate underrepresented scenarios, such as rare energy demand
spikes or supply failures. This approach ensures the models are exposed to a
wide range of operational conditions during training and evaluation.
Data
Pre-processing
Prior to model development, extensive data
pre-processing steps are performed to enhance data quality and ensure model
reliability. Initially, missing values are handled using a combination of
imputation techniques such as forward filling, backward filling, and K-nearest
neighbours (KNN) imputation, depending on the nature and distribution of the
missing data. Outliers and anomalies in consumption patterns are detected using
statistical methods like the Interquartile Range (IQR) method and isolation
forests, and are either corrected or removed based on their impact on model
learning. Categorical variables such as building type, location, and energy
source are encoded using one-hot encoding for non-ordinal categories and label
encoding for ordinal categories. Continuous variables are normalized using
Min-Max scaling to bring all features into a uniform range, facilitating faster
model convergence.
Furthermore, time-series data undergo feature
engineering to extract relevant temporal features such as hour of the day, day
of the week, and seasonality indicators. Lag features and rolling averages are
generated to capture temporal dependencies in energy usage. In cases where the
datasets exhibit significant class imbalance, particularly in predictive
maintenance and anomaly detection tasks, Synthetic Minority Over-sampling
Technique (SMOTE) is applied to balance the training data. Finally, the
datasets are split into training, validation, and testing sets, typically
following a 70:15:15 ratio, ensuring that model evaluation is based on unseen
data to provide a realistic estimate of generalization performance.
Exploratory
Data Analysis (EDA)
To gain initial insights into the energy consumption
data and understand the underlying patterns, an extensive Exploratory Data
Analysis (EDA) was conducted. This step helps in identifying trends, anomalies,
correlations, and structures within the dataset, thereby informing subsequent
modelling decisions. Visualizations such as distribution plots, time-series
plots, heatmaps, and correlation matrices were generated to systematically
explore the data.
The histogram and Kernel Density Estimation (KDE)
curve reveal that the energy consumption distribution is slightly right-skewed
(Figure 1). Most consumption values cluster around the lower to mid-range, with
fewer instances of very high consumption. This skewness suggests that while
typical usage is moderate, there are occasional peaks possibly due to external
factors such as extreme weather or industrial activities. Recognizing this
distribution is crucial because skewness may necessitate log-transformations or
normalization during model development to improve prediction accuracy. The time
series plot (Figure 2) demonstrates noticeable cyclical patterns in energy
consumption. Peaks and troughs correspond to daily, weekly, or seasonal cycles,
indicating a strong temporal dependency. This observation justifies the use of
time-series forecasting models such as LSTM and ARIMA in the model development
phase. Additionally, periodic drops and spikes suggest the need to incorporate
holiday calendars or weather data to explain certain anomalies.
The correlation heatmap (Figure 3) highlights the relationships
between different variables in the dataset. Strong positive correlations are
observed between energy consumption and external temperature variables,
suggesting that heating and cooling demands significantly impact consumption
patterns. Similarly, occupancy rates (for building-related datasets) or
production rates (for industrial datasets) show moderate correlation with
energy use. Features with high correlations will be prioritized during feature
selection, while highly collinear features may be removed to prevent redundancy
and multicollinearity issues. The boxplot (Figure 4) reveals significant
differences in energy consumption patterns across days of the week. Weekdays,
particularly Monday to Friday, exhibit higher and more variable energy usage
compared to weekends. This is expected in organizational or commercial
environments where energy demand drops during off-business days. Such a pattern
reinforces the idea that calendar-based features (like day of week or holiday
indicators) are important predictors for the machine learning models. The
seasonal decomposition (Figure 5) separates the time series into trend,
seasonal, and residual components. The trend component shows a steady rise in
energy consumption, possibly due to growth in operations or environmental
changes. The seasonal component uncovers recurring patterns—such as higher
consumption in summer or winter months due to HVAC usage. Understanding these
components separately provides strong motivation for seasonally-aware
predictive modelling, like seasonal ARIMA or Prophet models, that explicitly
capture these periodicities.
The scatter plot
(Figure 6) demonstrates a U-shaped relationship between temperature and energy
consumption. Energy usage increases significantly during both very low and very
high temperatures, likely due to heating and cooling demands, respectively.
This nonlinear behaviour implies that simple linear models might not capture
the relationship accurately, and hence non-linear models or polynomial terms
could enhance predictive performance. The missing values heatmap (Figure 7)
provides a visual assessment of data completeness. While most variables show
minimal missingness, a few intermittent gaps are observed in temperature and
occupancy data. Appropriate imputation strategies such as interpolation for
time-series data or median replacement for categorical periods are necessary to
ensure that the models are not biased or degraded by incomplete
records.

Figure
1: Distribution
of energy consumption.

Figure
2:
Time Series Plot of Energy Consumption over Time.