Feature-based Time Series Clustering in aeon¶
Feature-based time series clustering algorithms find descriptive features to represent the characteristics of time series and then perform clustering on the features. Various transformers can be used to derive features from the raw time-series data. Bespoke feature-based TSCL algorithms can be easily constructed with aeon transformers and sklearn clusterers in a pipeline. Currently, we have the following feature-based time series clusterers implemented in aeon:
Catch22ClustererTSFreshClustererSummaryClusterer
[1]:
# Imports and load data
from sklearn.cluster import KMeans
from aeon.clustering import TimeSeriesKMeans
from aeon.clustering.feature_based import (
Catch22Clusterer,
SummaryClusterer,
TSFreshClusterer,
)
from aeon.datasets import load_basic_motions
X_train, y_train = load_basic_motions()
X_test, y_test = load_basic_motions(split="test")
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(80, 6, 100) (80,) (40, 6, 100) (40,)
1. Catch22Clusterer¶
The Catch22Clusterer simply transforms the data into 22 features based on the Catch22 transformer and then builds a sklearn estimator on the transformed data. The Catch22 transformer transforms a d dimensional time-series into 22 CAnonical Time-series CHaracteristics derived from the 4791 filtered features of the hctsa feature library. Catch22 is a diverse and interpretable set of time-series features, including linear and non-linear autocorrelation, successive differences
etc.
[2]:
catch = Catch22Clusterer(estimator=KMeans(n_clusters=4))
catch.fit(X_train)
[2]:
Catch22Clusterer(estimator=KMeans(n_clusters=4))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Catch22Clusterer(estimator=KMeans(n_clusters=4))
KMeans(n_clusters=4)
KMeans(n_clusters=4)
[3]:
preds = catch.predict(X_train)
preds
[3]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 3, 3, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 3, 1, 2, 3, 3, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 3, 0, 3, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 1, 2, 2, 3, 2, 3, 2, 3, 3])
2. TSFreshClusterer¶
The TSFreshClusterer transforms the data using the TSFresh transform and builds a sklearn estimator on the transformed data. The TSFresh transformer computes 794 time-series features and automates the feature extraction and selection based on the FeatuRe Extraction based on Scalable Hypothesis tests (FRESH) algorithm. The algorithm is efficient and scales linearly with the number of features.
[4]:
tsfresh = TSFreshClusterer(estimator=KMeans(n_clusters=4))
tsfresh.fit(X_train)
[4]:
TSFreshClusterer(estimator=KMeans(n_clusters=4))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
TSFreshClusterer(estimator=KMeans(n_clusters=4))
KMeans(n_clusters=4)
KMeans(n_clusters=4)
[5]:
preds = tsfresh.predict(X_train)
preds
[5]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
3. SummaryClusterer¶
Like the above algorithms, this clusterer transforms the input data using the SevenNumberSummary transformer and builds an estimator using the transformed data.
The default estimator is a Random Forest with 200 trees, but we can use other sklearn estimators or aeon partition-based clusterers.
[6]:
summaryclst = SummaryClusterer(estimator=TimeSeriesKMeans(n_clusters=4))
summaryclst.fit(X_train)
[6]:
SummaryClusterer(estimator=TimeSeriesKMeans(n_clusters=4))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SummaryClusterer(estimator=TimeSeriesKMeans(n_clusters=4))
TimeSeriesKMeans(n_clusters=4)
TimeSeriesKMeans(n_clusters=4)
[7]:
preds = summaryclst.predict(X_test)
preds
[7]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], dtype=int64)
References¶
[1] Christopher Holder, Matthew Middlehurst, and Anthony Bagnall. A Review and Evaluation of Elastic Distance Functions for Time Series Clustering, Knowledge and Information Systems. In Press (2023)
[2] Lubba, Carl H., et al. “catch22: Canonical time-series characteristics.” Data Mining and Knowledge Discovery 33.6 (2019): 1821-1852. https://link.springer.com/article/10.1007/s10618-019-00647-x
[3] Christ, Maximilian, et al. “Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package).” Neurocomputing 307 (2018): 72-77. https://www.sciencedirect.com/science/article/pii/S0925231218304843
[4] Paparrizos, John, Fan Yang, and Haojun Li. “Bridging the gap: A decade review of time-series clustering methods.” arXiv preprint arXiv:2412.20582 (2024). https://arxiv.org/html/2412.20582v1
Generated using nbsphinx. The Jupyter notebook can be found here.