一、ARIMA模型概述
ARIMA 模型是一种流行且广泛使用的时间序列预测统计方法。ARIMA 是一个首字母缩写词,代表 AutoRegressive Integrated Moving Average。
ARIMA 的 AR 部分显示时间序列是基于其自身过去的数据进行回归的。ARIMA 的 MA 部分表示预测误差是过去各自误差的线性组合。ARIMA的I部分表明数据值已被替换为d阶的差分值以获得平稳数据,这是 ARIMA 模型方法的要求。ARIMA 模型可以有效地使用这种组合方法拟合过去的数据,并有助于预测时间序列中的未来点。
二、数据集可视化
数据取自 GEFCom2014 预测竞赛。它由 2012 年至 2014 年间 3 年的每小时电力负荷和温度值组成。
1、数据集的下载地址
链接:https://pan.baidu.com/s/1k_7wlGjScYLM8wctryC4mQ
提取码:qe7m
2、首先创建工具类utils.py,我这里是放在common文件夹下。
import numpy as np
import pandas as pd
import os
from collections import UserDict
def load_data(data_dir):
"""Load the GEFCom 2014 energy load data"""
energy = pd.read_csv(os.path.join(data_dir, 'energy.csv'), parse_dates=['timestamp'])
# Reindex the dataframe such that the dataframe has a record for every time point
# between the minimum and maximum timestamp in the time series. This helps to
# identify missing time periods in the data (there are none in this dataset).
energy.index = energy['timestamp']
energy = energy.reindex(pd.date_range(min(energy['timestamp']),
max(energy['timestamp']),
freq='H'))
energy = energy.drop('timestamp', axis=1)
return energy
def mape(predictions, actuals):
"""Mean absolute percentage error"""
predictions = np.array(predictions)
actuals = np.array(actuals)
return (np.absolute(predictions - actuals) / actuals).mean()
def create_evaluation_df(predictions, test_inputs, H, scaler):
"""Create a data frame for easy evaluation"""
eval_df = pd.DataFrame(predictions, columns=['t+'+str(t) for t in range(1, H+1)])
eval_df['timestamp'] = test_inputs.dataframe.index
eval_df = pd.melt(eval_df, id_vars='timestamp', value_name='prediction', var_name='h')
eval_df['actual'] = np.transpose(test_inputs['target']).ravel()
eval_df[['prediction', 'actual']] = scaler.inverse_transform(eval_df[['prediction', 'actual']])
return eval_df
class TimeSeriesTensor(UserDict):
"""A dictionary of tensors for input into the RNN model.
Use this class to:
1. Shift the values of the time series to create a Pandas dataframe containing all the data
for a single training example
2. Discard any samples with missing values
3. Transform this Pandas dataframe into a numpy array of shape
(samples, time steps, features) for input into Keras
The class takes the following parameters:
- **dataset**: original time series
- **target** name of the target column
- **H**: the forecast horizon
- **tensor_structures**: a dictionary describing the tensor structure of the form
{ 'tensor_name' : (range(max_backward_shift, max_forward_shift), [feature, feature, ...] ) }
if features are non-sequential and should not be shifted, use the form
{ 'tensor_name' : (None, [feature, feature, ...])}
- **freq**: time series frequency (default 'H' - hourly)
- **drop_incomplete**: (Boolean) whether to drop incomplete samples (default True)
"""
def __init__(self, dataset, target, H, tensor_structure, freq='H', drop_incomplete=True):
self.dataset = dataset
self.target = target
self.tensor_structure = tensor_structure
self.tensor_names = list(tensor_structure.keys())
self.dataframe = self._shift_data(H, freq, drop_incomplete)
self.data = self._df2tensors(self.dataframe)
def _shift_data(self, H, freq, drop_incomplete):
# Use the tensor_structures definitions to shift the features in the original dataset.
# The result is a Pandas dataframe with multi-index columns in the hierarchy
# tensor - the name of the input tensor
# feature - the input feature to be shifted
# time step - the time step for the RNN in which the data is input. These labels
# are centred on time t. the forecast creation time
df = self.dataset.copy()
idx_tuples = []
for t in range(1, H+1):
df['t+'+str(t)] = df[self.target].shift(t*-1, freq=freq)
idx_tuples.append(('target', 'y', 't+'+str(t)))
for name, structure in self.tensor_structure.items():
rng = structure[0]
dataset_cols = structure[1]
for col in dataset_cols:
# do not shift non-sequential 'static' features
if rng is None:
df['context_'+col] = df[col]
idx_tuples.append((name, col, 'static'))
else:
for t in rng:
sign = '+' if t > 0 else ''
shift = str(t) if t != 0 else ''
period = 't'+sign+shift
shifted_col = name+'_'+col+'_'+period
df[shifted_col] = df[col].shift(t*-1, freq=freq)
idx_tuples.append((name, col, period))
df = df.drop(self.dataset.columns, axis=1)
idx = pd.MultiIndex.from_tuples(idx_tuples, names=['tensor', 'feature', 'time step'])
df.columns = idx
if drop_incomplete:
df = df.dropna(how='any')
return df
def _df2tensors(self, dataframe):
# Transform the shifted Pandas dataframe into the multidimensional numpy arrays. These
# arrays can be used to input into the keras model and can be accessed by tensor name.
# For example, for a TimeSeriesTensor object named "model_inputs" and a tensor named
# "target", the input tensor can be acccessed with model_inputs['target']
inputs = {}
y = dataframe['target']
y = y.as_matrix()
inputs['target'] = y
for name, structure in self.tensor_structure.items():
rng = structure[0]
cols = structure[1]
tensor = dataframe[name][cols].as_matrix()
if rng is None:
tensor = tensor.reshape(tensor.shape[0], len(cols))
else:
tensor = tensor.reshape(tensor.shape[0], len(cols), len(rng))
tensor = np.transpose(tensor, axes=[0, 2, 1])
inputs[name] = tensor
return inputs
def subset_data(self, new_dataframe):
# Use this function to recreate the input tensors if the shifted dataframe
# has been filtered.
self.dataframe = new_dataframe
self.data = self._df2tensors(self.dataframe)
3、然后创建visualize_data.py文件。
import os
import matplotlib.pyplot as plt
from common.utils import load_data
data_dir = './data'
energy = load_data(data_dir)[['load']]
print(energy.head())
数据的前几条长下面这个样
load
2012-01-01 00:00:00 2698.0
2012-01-01 01:00:00 2558.0
2012-01-01 02:00:00 2444.0
2012-01-01 03:00:00 2402.0
2012-01-01 04:00:00 2403.0
绘制曲线图
绘制2014年第一周
energy['2014-07-01':'2014-07-07'].plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
三、使用ARIMA模型
1、Statsmodels简介
包含在statsmodels中的一些模型:
线性模型,广义线性模型和鲁棒线性模型
线性混合效应模型
方差分析(ANOVA)方法
时间序列过程和状态空间模型
广义的矩量法
2、参考训练代码
创建一个名为statsmodels_train.py的文件。
import os
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime as dt
import math
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.preprocessing import MinMaxScaler
from common.utils import load_data, mape
from IPython.display import Image
pd.options.display.float_format = '{:,.2f}'.format
np.set_printoptions(precision=2)
warnings.filterwarnings("ignore") # specify to ignore warning messages
energy = load_data('./data')[['load']]
energy.head(10)
energy.plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
train_start_dt = '2014-11-01 00:00:00'
test_start_dt = '2014-12-30 00:00:00'
energy[(energy.index < test_start_dt) & (energy.index >= train_start_dt)][['load']].rename(columns={'load':'train'}) \
.join(energy[test_start_dt:][['load']].rename(columns={'load':'test'}), how='outer') \
.plot(y=['train', 'test'], figsize=(15, 8), fontsize=12)
plt.xlabel('timestamp', fontsize=12)
plt.ylabel('load', fontsize=12)
plt.show()
train = energy.copy()[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']]
test = energy.copy()[energy.index >= test_start_dt][['load']]
print('Training data shape: ', train.shape)
print('Test data shape: ', test.shape)
scaler = MinMaxScaler()
train['load'] = scaler.fit_transform(train)
train.head(10)
energy[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']].rename(columns={'load':'original load'}).plot.hist(bins=100, fontsize=12)
train.rename(columns={'load':'scaled load'}).plot.hist(bins=100, fontsize=12)
plt.show()
test['load'] = scaler.transform(test)
test.head()
# Specify the number of steps to forecast ahead
HORIZON = 3
print('Forecasting horizon:', HORIZON, 'hours')
order = (4, 1, 0)
seasonal_order = (1, 1, 0, 24)
model = SARIMAX(endog=train, order=order, seasonal_order=seasonal_order)
results = model.fit()
print(results.summary())
test_shifted = test.copy()
for t in range(1, HORIZON+1):
test_shifted['load+'+str(t)] = test_shifted['load'].shift(-t, freq='H')
test_shifted = test_shifted.dropna(how='any')
test_shifted.head(5)
training_window = 720 # dedicate 30 days (720 hours) for training
train_ts = train['load']
test_ts = test_shifted
history = [x for x in train_ts]
history = history[(-training_window):]
predictions = list()
order = (2, 1, 0)
seasonal_order = (1, 1, 0, 24)
for t in range(test_ts.shape[0]):
model = SARIMAX(endog=history, order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
yhat = model_fit.forecast(steps = HORIZON)
predictions.append(yhat)
obs = list(test_ts.iloc[t])
# move the training window
history.append(obs[0])
history.pop(0)
print(test_ts.index[t])
print(t+1, ': predicted =', yhat, 'expected =', obs)
最后预测时打印预测结果和真实结果。
predicted = [0.79 0.73 0.63 0.49] expected = [0.7721575649059982, 0.7023276633840643, 0.6195165622202325, 0.5425246195165621]
四、使用支持向量回归器
同样使用SVR应用在当前数据集上,参考代码如下。
评论(0)
您还未登录,请登录后发表或查看评论