Data structures and containers to use for aeon estimators¶
aeon includes algorithms for time series forecasting and machine learning. These two communities have different conventions on how to store data and what to call data structures. Some of the differences are
Forecasters almost always stores data in pandas data structures, whereas machine learners use numpy arrays almost exclusively.
n forecasting a 2 dimensional data is almost always shape
(n_timepoints, n_timeseries)
whereas in machine learning we would tend to store data in a(n_timeseries, n_timepoints)
array.In forecasting, a variable
y
refers to a time series for which we are attempting to make a forecast, hencey
is assumed to be ordered. In machine learning,y
is a list of either class labels (for classification) or observations of a response vairable (for regression). The ordering of values iny
is determined by the ordering of theX
input.
Because of these sources of confusion, we recommend that you store data in pandas data structures for forecasting and numpy arrays for machine learning. We support other data containers, see the data conversion page for more info.
Forecasting data¶
aeon forecasting uses pd.Series
, pd.DataFrame
and pd.Multiindex
to store data. pd.Series
are used to store a univariate time series with entries corresponding to different time points.
[2]:
# Forecasting data in a pandas.Series
import numpy as np
import pandas as pd
from aeon.forecasting.trend import TrendForecaster
y = pd.Series([20.0, 40.0, 60.0, 80.0, 100.0])
forecaster = TrendForecaster()
forecaster.fit(y) # fit the forecaster
forecaster.predict(fh=[1, 2, 3]) # forecast the next 3 values
[2]:
5 120.0
6 140.0
7 160.0
dtype: float64
pd.DataFrame
are used to store multiple time series, where each column is a time series, and each row corresponds to a different, distinct time point. The index is the time point and should be monotonic. This creates two series called Sales and Temperature, and stores observations for time points 0,1,2,3,4,5.
[3]:
ice_creams = {
"Sales": [111, 100, 90, 80, 65, 89],
"Temperature": [26, 21, 19, 14, 12, 22],
}
# Create DataFrame
ice_creams = pd.DataFrame(ice_creams)
print(ice_creams)
from aeon.forecasting.exp_smoothing import ExponentialSmoothing
forecaster = ExponentialSmoothing()
forecaster.fit(ice_creams)
forecaster.predict(fh=[1, 2, 3])
Sales Temperature
0 111 26
1 100 21
2 90 19
3 80 14
4 65 12
5 89 22
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
File ~/projects/aeon/aeon/utils/validation/_dependencies.py:118, in _check_soft_dependencies(package_import_alias, severity, obj, suppress_import_stdout, *packages)
117 else:
--> 118 pkg_ref = import_module(package_import_name)
119 # if package cannot be imported, make the user aware of installation requirement
File ~/.pyenv/versions/3.8.18/lib/python3.8/importlib/__init__.py:127, in import_module(name, package)
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1014, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:991, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:973, in _find_and_load_unlocked(name, import_)
ModuleNotFoundError: No module named 'statsmodels'
The above exception was the direct cause of the following exception:
ModuleNotFoundError Traceback (most recent call last)
Cell In[3], line 10
7 print(ice_creams)
8 from aeon.forecasting.exp_smoothing import ExponentialSmoothing
---> 10 forecaster = ExponentialSmoothing()
11 forecaster.fit(ice_creams)
12 forecaster.predict(fh=[1, 2, 3])
File ~/projects/aeon/aeon/forecasting/exp_smoothing.py:167, in ExponentialSmoothing.__init__(self, trend, damped_trend, seasonal, sp, initial_level, initial_trend, initial_seasonal, use_boxcox, initialization_method, smoothing_level, smoothing_trend, smoothing_seasonal, damping_trend, optimized, remove_bias, start_params, method, minimize_kwargs, use_brute, random_state)
164 self.minimize_kwargs = minimize_kwargs
165 self.use_brute = use_brute
--> 167 super().__init__(random_state=random_state)
File ~/projects/aeon/aeon/forecasting/base/adapters/_statsmodels.py:34, in _StatsModelsAdapter.__init__(self, random_state)
32 self.random_state = random_state
33 self._fitted_forecaster = None
---> 34 super(_StatsModelsAdapter, self).__init__()
File ~/projects/aeon/aeon/forecasting/base/_base.py:117, in BaseForecaster.__init__(self)
114 self._converter_store_y = dict() # storage dictionary for in/output conversion
116 super(BaseForecaster, self).__init__()
--> 117 _check_estimator_deps(self)
File ~/projects/aeon/aeon/utils/validation/_dependencies.py:364, in _check_estimator_deps(obj, msg, severity)
362 pkg_deps = [pkg_deps]
363 if pkg_deps is not None:
--> 364 pkg_deps_ok = _check_soft_dependencies(*pkg_deps, severity=severity, obj=obj)
365 compatible = compatible and pkg_deps_ok
367 return compatible
File ~/projects/aeon/aeon/utils/validation/_dependencies.py:140, in _check_soft_dependencies(package_import_alias, severity, obj, suppress_import_stdout, *packages)
130 msg = (
131 f"{class_name} requires package '{package}' to be present "
132 f"in the python environment, but '{package}' was not found. "
(...)
137 f"aeon[all_extras]`"
138 )
139 if severity == "error":
--> 140 raise ModuleNotFoundError(msg) from e
141 elif severity == "warning":
142 warnings.warn(msg, stacklevel=2)
ModuleNotFoundError: ExponentialSmoothing requires package 'statsmodels' to be present in the python environment, but 'statsmodels' was not found. 'statsmodels' is a soft dependency and not included in the base aeon installation. Please run: `pip install statsmodels` to install the statsmodels package. To install all soft dependencies, run: `pip install aeon[all_extras]`
You can add a date-time index, and this is required by some forecasters (e.g. Prophet).
[ ]:
ice_creams["datetime"] = pd.to_datetime(
[
"01-06-2018 23:15:00", # Creating data
"02-09-2019 01:48:00",
"08-06-2020 13:20:00",
"07-03-2021 14:50:00",
"07-06-2022 11:50:00",
"03-05-2023 16:50:00",
]
)
ice_creams = ice_creams.set_index("datetime")
print(ice_creams)
pd.DataFrame
also have the capability to store multiple indexes, which can be used to represent whats called Panel data in forecasting hierarchical data. A Panel is a collection of (possibly) multivariate data.
[4]:
from aeon.testing.utils.data_gen import _make_hierarchical
y = _make_hierarchical()
y.head()
[4]:
c0 | |||
---|---|---|---|
h0 | h1 | time | |
h0_0 | h1_0 | 2000-01-01 | 2.211608 |
2000-01-02 | 3.068498 | ||
2000-01-03 | 3.925924 | ||
2000-01-04 | 2.900095 | ||
2000-01-05 | 4.324984 |
[5]:
forecaster.fit(y, fh=[1, 2]).predict()
[5]:
c0 | |||
---|---|---|---|
h0 | h1 | time | |
h0_0 | h1_0 | 2000-01-13 | 1.812590 |
2000-01-14 | 1.668527 | ||
h1_1 | 2000-01-13 | 3.445072 | |
2000-01-14 | 3.419808 | ||
h1_2 | 2000-01-13 | 3.068927 | |
2000-01-14 | 2.974026 | ||
h1_3 | 2000-01-13 | 3.888089 | |
2000-01-14 | 3.980505 | ||
h0_1 | h1_0 | 2000-01-13 | 4.068846 |
2000-01-14 | 4.208284 | ||
h1_1 | 2000-01-13 | 3.588580 | |
2000-01-14 | 3.560109 | ||
h1_2 | 2000-01-13 | 2.493876 | |
2000-01-14 | 2.344370 | ||
h1_3 | 2000-01-13 | 2.224848 | |
2000-01-14 | 2.020582 |
np.ndarray
can be used with the forecasters in aeon, although we recommend using pandas. One dimensional np.ndarray are treated as a single time series. 2D numpy array are treated as multiple series of shape (n_timeseries, n_timepoints)
. Forecasters fit independently on each series.
[6]:
y = np.array([20.0, 40.0, 60.0, 80.0, 100.0])
forecaster = TrendForecaster()
forecaster.fit(y) # fit the forecaster
forecaster.predict(fh=[1, 2, 3]) # forecast the next 3 values
[6]:
array([[120.],
[140.],
[160.]])
[7]:
y = np.array([[20.0, 40.0, 60.0, 80.0, 100.0], [100.0, 90.0, 80.0, 70.0, 60.0]])
y = y.transpose()
forecaster = TrendForecaster()
forecaster.fit(y) # fit the forecaster
forecaster.predict(fh=[1, 2, 3]) # forecast the next 3 values
[7]:
array([[120., 50.],
[140., 40.],
[160., 30.]])
Machine learning data¶
Machine learning algorithms generally use collections of instances or cases stored as numpy arrays. Like scikit-learn, pytorch and keras, we primarily use numpy arrays. A collection contains a number of time series cases (or just cases) which we refer to in code as n_cases
. Each case contains a number of time series observations, which we denote n_timepoints
.
[8]:
X = np.array(
[
[[20.0, 40.0, 60.0, 80.0, 100.0]], # Univariate series as 3D array
[[100.0, 90.0, 80.0, 70.0, 60.0]],
]
) # n_cases = 2, n_channels =1, n_timepoints = 5
print("X shape = ", X.shape, " First series =", X[0], "second series = ", X[1])
X shape = (2, 1, 5) First series = [[ 20. 40. 60. 80. 100.]] second series = [[100. 90. 80. 70. 60.]]
[9]:
X = np.array(
[
[[20, 40, 600, 55], [10, 11, 12, 11], [-4, 1, 6.6, 2]],
[[10, 90, 80, 100], [14, 70, 60, 22], [49, 49, 66, 9]],
[[14, 6, 10, -401], [44, 70, 60, 22], [49, 52, 33, 49]],
[[22, 93, 18, 100], [34, 170, 0, 87], [49, 49, 33, 49]],
]
)
# n_cases = 4, n_channels =3, n_timepoints = 4
print("X shape = ", X.shape, "\n First series =\n", X[0], "\nsecond series = \n", X[1])
from aeon.clustering import TimeSeriesKMeans
kmeans = TimeSeriesKMeans(distance="euclidean", n_clusters=2)
kmeans.fit(X)
kmeans.predict(X)
X shape = (4, 3, 4)
First series =
[[ 20. 40. 600. 55. ]
[ 10. 11. 12. 11. ]
[ -4. 1. 6.6 2. ]]
second series =
[[ 10. 90. 80. 100.]
[ 14. 70. 60. 22.]
[ 49. 49. 66. 9.]]
[9]:
array([0, 1, 1, 1])
The target variable for classification should be stored as a np.ndarray of integers or strings
[10]:
y = np.array([1, 1, 0, 0])
y2 = np.array(["pass", "pass", "fail", "fail"])
from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier
knn = KNeighborsTimeSeriesClassifier(distance="dtw")
knn.fit(X, y)
knn.fit(X, y2)
knn.predict(X)
[10]:
array(['pass', 'pass', 'fail', 'fail'], dtype='<U4')
For regression, the target variable should be of type float
[11]:
y = np.array([1.5, 4.3, -2.0, 10])
from aeon.regression.distance_based import KNeighborsTimeSeriesRegressor
knn_r = KNeighborsTimeSeriesRegressor(distance="dtw")
knn_r.fit(X, y)
knn_r.predict(X)
[11]:
array([ 1.5, 4.3, -2. , 10. ])
If the time series are not all equal length, they should be stored as a list of 2D numpy arrays. Some estimators can deal with unequal length series. Those that can’t will raise an exception if passed unequal length series. Note we assume that channels are all the same length for any given series.
[12]:
x0 = np.array([[20, 40, 60, 55, 66], [10, 11, 12, 11, 66], [-4, 15, 6.6, 12, 44]])
x1 = np.array([[10, 90, 80], [70, 60, 22], [49, 66, 9]])
x2 = np.array([[22, 93, 18, 100], [34, 170, 0, 87], [49, 49, 33, 49]])
X_uneq = []
X_uneq.append(x0)
X_uneq.append(x1)
X_uneq.append(x2)
y = np.array([0, 0, 1])
knn.fit(X_uneq, y)
knn.predict(X_uneq)
[12]:
array([0, 0, 1])
aeon has several standard problems baked in, and facilities for loading data from external sources. Please see the data loading notebook
[ ]:
Generated using nbsphinx. The Jupyter notebook can be found here.