Benchmarking: retrieving and comparing against reference results#

You can access all the latest results for classification, clustering and regression directly with aeon. These results are all stored on the website

timeseriesclassification.com. These results were presented in three bake offs for classification [1], regression [2] and clustering [3]. We use three aeon classifiers for our examples.

FreshPRINCE [4] (located in classification/feature_based) is a pipeline of TSFresh transform followed by a rotation forest classifier. InceptionTimeClassifier [5] is a deep learning ensemble.

HIVECOTEV2 [6] is a meta ensemble of four different ensembles built on different representations. See [1] for an overview of recent advances in time series classification.

[1]:

from aeon.benchmarking import plot_critical_difference
from aeon.benchmarking.results_loaders import (
    get_estimator_results,
    get_estimator_results_as_array,
)

classifiers = [
    "FreshPRINCEClassifier",
    "HIVECOTEV2",
    "InceptionTimeClassifier",
]
datasets = ["ACSF1", "ArrowHead", "GunPoint", "ItalyPowerDemand"]
# get results. To read locally, set the path variable.
# If you do not set path, results are loaded from
# https://timeseriesclassification.com/results/ReferenceResults.
# You can download the files directly from there
default_split_all, data_names = get_estimator_results_as_array(estimators=classifiers)
print(
    " Returns an array with each column an estimator, shape (data_names, classifiers)"
)
print(
    f" By default recovers the default test split results for {len(data_names)} "
    f"equal length UCR datasets from {data_names[0]} to {data_names[-1]}"
)
default_split_some, names = get_estimator_results_as_array(
    estimators=classifiers, datasets=datasets
)
print(
    f" Or specify data sets for result recovery. {len(names)} For example, "
    f"HIVECOTEV2 accuracy {names[3]} = {default_split_some[3][1]}"
)

 Returns an array with each column an estimator, shape (data_names, classifiers)
 By default recovers the default test split results for 112 equal length UCR datasets from SmallKitchenAppliances to ProximalPhalanxTW
 Or specify data sets for result recovery. 4 For example, HIVECOTEV2 accuracy ItalyPowerDemand = 0.9698736637512148

If you have any questions about these results or the datasets, please raise an issue on the associated repo. You can also recover results in a dictionary, where each key is a classifier name, and the values is a dictionary of problems/results.

[2]:

hash_table = get_estimator_results(estimators=classifiers)
print("Keys = ", hash_table.keys())
print(
    " Accuracy of HIVECOTEV2 on ItalyPowerDemand = ",
    hash_table["HIVECOTEV2"]["ItalyPowerDemand"],
)

Keys =  dict_keys(['FreshPRINCEClassifier', 'HIVECOTEV2', 'InceptionTimeClassifier'])
 Accuracy of HIVECOTEV2 on ItalyPowerDemand =  0.9698736637512148

The results recovered so far have all been on the default train/test split. If we merge train and test data and resample, you can get very different results. To allow for this, we average results over 30 resamples. You can recover these averages by setting the default_only parameter to False.

[3]:

resamples_all, data_names = get_estimator_results_as_array(
    estimators=classifiers, default_only=False
)
print(" Results are averaged over 30 stratified resamples. ")
print(
    f" HIVECOTEV2 train test of  {data_names[3]} = "
    f"{default_split_all[3][1]} and averaged over 30 resamples = "
    f"{resamples_all[3][1]}"
)

 Results are averaged over 30 stratified resamples.
 HIVECOTEV2 train test of  Computers = 0.76 and averaged over 30 resamples = 0.8556000000000001

So once you have the results you want, you can compare classifiers with built in aeon tools. For example, you can draw a critical difference diagram [7]. This displays the average rank of each estimator over all datasets. It then groups estimators for which there is no significant difference in rank into cliques, shown with a solid bar. So in the example below with the default train test splits, FreshPRINCEClassifier is not significantly different in ranks to InceptionTimeClassifier, but HIVECOTEV2 is significantly better.

[4]:

plot_critical_difference(default_split_all, classifiers)

If we use the data averaged over resamples, we can detect differences more clearly. Now we see InceptionTimeClassifier is significantly better than the FreshPRINCEClassifier.

plot_critical_difference(resamples_all, classifiers)

tsc.com has results for classification, clustering and regression. We are constantly updating the results as we generate them. To find out which estimators have results get_available_estimators

[5]:

from aeon.benchmarking import get_available_estimators

print(get_available_estimators(task="Classification"))
# print(get_available_estimators(task="Regression"))
# print(get_available_estimators(task="Clustering"))

  classification
0    FreshPRINCE
1            HC2
2       Hydra-MR
3     InceptionT
4             PF
5           RDST
6          RSTSF
7       WEASEL_D

Other tools are available for comparing classifiers. 1. Boxplot of deviations from the median 2. Pairwise scatter plots 3. Perform all pairwise tests 4.

References#

[1] Middlehurst et al. “Bake off redux: a review and experimental evaluation of recent time series classification algorithms”, 2023, arXiv [2] Holder et al. “A Review and Evaluation of Elastic Distance Functions for Time Series Clustering”, 2022, arXiv [3] Guijo-Rubio et al. “Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression”, 2023 arXiv [4] Middlehurst and Bagnall, “The FreshPRINCE: A Simple Transformation Based Pipeline Time Series Classifier”, 2022 arXiv [5] Fawaz et al. “InceptionTime: Finding AlexNet for time series classification”, 2020 DAMI [6] Middlehurst et al. “HIVE-COTE 2.0: a new meta ensemble for time series classification”, MACH [7] Demsar, “Statistical Comparisons of Classifiers over Multiple Data Sets” JMLR

Generated using nbsphinx. The Jupyter notebook can be found here.