binder

Benchmarking: retrieving and comparing against reference results

You can access all the latest results for classification, clustering and regression directly with aeon. These results are all stored on the website timeseriesclassification.com. This notebook is about recovering the latest results. Because of software changes, these may vary slightly from published results. If you want to recover published results, see the notebook on loading published results. We update the results as we get them. If you want to see the latest results, they are listed here.

These are the current estimators with results available:

[1]:
from aeon.benchmarking.results_loaders import get_available_estimators

get_available_estimators(task="classification")
[1]:
classification
0 1NN-DTW
1 Arsenal
2 BOSS
3 CIF
4 CNN
5 Catch22
6 DrCIF
7 EE
8 FreshPRINCE
9 GRAIL
10 H-InceptionTime
11 HC1
12 HC2
13 Hydra
14 InceptionTime
15 LiteTime
16 MR
17 MR-Hydra
18 MiniROCKET
19 MrSQM
20 PF
21 QUANT
22 R-STSF
23 RDST
24 RISE
25 RIST
26 ROCKET
27 RSF
28 ResNet
29 STC
30 STSF
31 ShapeDTW
32 Signatures
33 TDE
34 TS-CHIEF
35 TSF
36 TSFresh
37 WEASEL-1.0
38 WEASEL-2.0
39 cBOSS
[2]:
get_available_estimators(task="regression")
[2]:
regression
0 1NN-DTW
1 1NN-ED
2 5NN-DTW
3 5NN-ED
4 CNN
5 DrCIF
6 FCN
7 FPCR
8 FPCR-b-spline
9 FreshPRINCE
10 GridSVR
11 InceptionTime
12 RandF
13 ResNet
14 Ridge
15 ROCKET
16 RotF
17 SingleInceptionTime
18 XGBoost
[3]:
get_available_estimators(task="clustering")
[3]:
clustering
0 dtw-dba
1 kmeans-ddtw
2 kmeans-dtw
3 kmeans-ed
4 kmeans-edr
5 kmeans-erp
6 kmeans-lcss
7 kmeans-msm
8 kmeans-twe
9 kmeans-wddtw
10 kmeans-wdtw
11 kmedoids-ddtw
12 kmedoids-dtw
13 kmedoids-ed
14 kmedoids-edr
15 kmedoids-erp
16 kmedoids-lcss
17 kmedoids-msm
18 kmedoids-twe
19 kmedoids-wddtw
20 kmedoids-wdtw

Loading results (classification example)

We will use the classification task as an example. We will recover the results for FreshPRINCE [4] a pipeline of TSFresh transform followed by a rotation forest classifier. InceptionTimeClassifier [5] is a deep learning ensemble. HIVECOTEV2 [6] is a meta ensemble of four different ensembles built on different representations. RDST [7] extracts random shalepets with dilation to form a pipeline.

See [1] for an overview of recent advances in time series classification. We also store results for other learning tasks, such as regression [2] and clustering [3].

If you do not set path, results are loaded from https://timeseriesclassification.com/results/ReferenceResults. You can download the files directly from there. To read locally, set the path variable. While we don’t show this here, the task parameter can be set to regression or clustering to recover those results.

[4]:
classifiers = [
    "FreshPRINCEClassifier",
    "HIVECOTEV2",
    "InceptionTimeClassifier",
    "RDSTClassifier",
]
datasets = ["ACSF1", "ArrowHead", "GunPoint", "ItalyPowerDemand"]

The get_estimator_results function returns the resutls as a dictionary of dictionaries, where the first key is the classifier name and the second key is the dataset name.

[5]:
from aeon.benchmarking.results_loaders import get_estimator_results

results_dict = get_estimator_results(estimators=classifiers, datasets=datasets)
results_dict["HIVECOTEV2"]["ItalyPowerDemand"]
[5]:
0.9698736637512148

Most results files have multiple resamples. These can be returned as an array using the num_resamples parameter.

[6]:
results_dict = get_estimator_results(
    estimators=classifiers, datasets=datasets, num_resamples=30
)
results_dict["HIVECOTEV2"]["ItalyPowerDemand"]
[6]:
array([0.96987366, 0.96987366, 0.9494655 , 0.96793003, 0.96015549,
       0.96793003, 0.96793003, 0.95626822, 0.96695821, 0.96695821,
       0.96793003, 0.96695821, 0.95724004, 0.94557823, 0.96987366,
       0.96598639, 0.96501458, 0.96015549, 0.9718173 , 0.96793003,
       0.96598639, 0.95626822, 0.96112731, 0.96695821, 0.96209913,
       0.95918367, 0.96209913, 0.95918367, 0.95043732, 0.96598639])

Different measures can be recovered, such as accuracy, F1, AUROC, and logloss using the measure parameter. The default is accuracy.

[7]:
results_dict = get_estimator_results(
    estimators=classifiers, datasets=datasets, measure="logloss"
)
results_dict["HIVECOTEV2"]["ItalyPowerDemand"]
[7]:
0.1217826955959029

Results can also be returned as an array using the get_estimator_results_as_array function. This function shares the same parameters as get_estimator_results.

This function returns the results as a numpy array, where the first dimension is the dataset and the second dimension is the estimator. The datasets used in the array are returned as a list alongside the results.

Multiple resamples will be averaged instead of returned as separate arrays.

[8]:
from aeon.benchmarking.results_loaders import get_estimator_results_as_array

results_arr, datasets = get_estimator_results_as_array(
    estimators=classifiers, datasets=datasets
)
results_arr
[8]:
array([[0.89      , 0.91      , 0.91      , 0.9       ],
       [0.62857143, 0.86857143, 0.86285714, 0.85714286],
       [0.94      , 1.        , 1.        , 1.        ],
       [0.89795918, 0.96987366, 0.96598639, 0.93974733]])
[9]:
datasets
[9]:
['ACSF1', 'ArrowHead', 'GunPoint', 'ItalyPowerDemand']

By default if a dataset is missing for any estimator, the dataset is removed from the results and list of datasets. If you want to keep the dataset, use the include_missing parameter. Missing results will be filled with a NaN value.

[10]:
from aeon.benchmarking.results_loaders import get_estimator_results_as_array

results_arr_miss, datasets = get_estimator_results_as_array(
    estimators=classifiers, datasets=datasets + ["invalid"], include_missing=True
)
results_arr_miss
[10]:
array([[0.89      , 0.91      , 0.91      , 0.9       ],
       [0.62857143, 0.86857143, 0.86285714, 0.85714286],
       [0.94      , 1.        , 1.        , 1.        ],
       [0.89795918, 0.96987366, 0.96598639, 0.93974733],
       [       nan,        nan,        nan,        nan]])

For both methods, the default value for datasets will load all available datasets for the estimators. We will use this for our later examples.

[11]:
results_arr, datasets = get_estimator_results_as_array(
    estimators=classifiers, num_resamples=30
)
results_arr
[11]:
array([[0.72683761, 0.84384615, 0.85982906, 0.84709402],
       [0.93189143, 0.9551145 , 0.95808312, 0.92726039],
       [0.77272727, 0.74848485, 0.78095238, 0.72251082],
       [0.91486486, 0.93774775, 0.94297297, 0.94738739],
       [0.92333333, 0.94833333, 0.95166667, 0.91333333],
       [0.87177849, 0.92846113, 0.88056443, 0.93176251],
       [0.95555556, 0.99844444, 0.99511111, 0.99488889],
       [0.99388186, 0.99978903, 0.99831224, 0.99978903],
       [0.75083933, 0.74676259, 0.73117506, 0.73884892],
       [0.71570513, 0.92964744, 0.90528846, 0.90721154],
       [0.96077098, 0.96303855, 0.96034985, 0.95118238],
       [1.        , 1.        , 1.        , 1.        ],
       [0.83345411, 0.8281401 , 0.81533816, 0.81594203],
       [0.77762889, 0.80081891, 0.77321794, 0.79669645],
       [0.79111111, 0.72111111, 0.68222222, 0.74222222],
       [0.9952381 , 1.        , 0.9968254 , 1.        ],
       [0.76872325, 0.78963335, 0.77076121, 0.77374837],
       [0.89691358, 0.94135802, 0.88703704, 0.94753086],
       [0.97155556, 0.98022222, 0.98466667, 0.99222222],
       [0.99981481, 1.        , 0.92351852, 0.99703704],
       [0.84113821, 0.85707317, 0.82211382, 0.84552846],
       [0.35571378, 0.39694093, 0.33444093, 0.3410865 ],
       [0.79350649, 0.8030303 , 0.8025974 , 0.77835498],
       [0.9448855 , 0.96430874, 0.9621883 , 0.942324  ],
       [0.45384615, 0.95689103, 0.92211538, 0.78092949],
       [0.74746667, 0.8168    , 0.75893333, 0.75022222],
       [0.89843299, 0.95437801, 0.91333333, 0.95641237],
       [0.57987013, 0.56991342, 0.52748918, 0.5547619 ],
       [0.67244444, 0.75395556, 0.70604444, 0.64622222],
       [0.50400433, 0.55335498, 0.53701299, 0.570671  ],
       [0.82897778, 0.8424    , 0.77075556, 0.82435556],
       [0.94133333, 0.95503704, 0.76288889, 0.88674074],
       [0.99588889, 0.99866667, 0.99577778, 0.99244444],
       [0.80383693, 0.81342926, 0.76570743, 0.80431655],
       [0.97881253, 0.98193298, 0.97812854, 0.98103448],
       [0.73333333, 0.79315068, 0.82054795, 0.73561644],
       [0.73487179, 0.86683761, 0.86094017, 0.85555556],
       [0.99986613, 1.        , 1.        , 0.96519411],
       [0.84581901, 0.83058419, 0.83424971, 0.82646048],
       [0.94181919, 0.95013866, 0.95418747, 0.93716029],
       [0.64801347, 0.65671717, 0.6273569 , 0.65996633],
       [0.94982818, 0.97469416, 0.96637801, 0.97943643],
       [0.84770785, 0.84304584, 0.86134421, 0.82909868],
       [0.8       , 0.848     , 0.82666667, 0.83266667],
       [0.98766667, 0.99848148, 0.99611111, 0.99096296],
       [0.78838095, 0.8992381 , 0.88038095, 0.87828571],
       [0.99998378, 1.        , 0.99862643, 0.99991348],
       [0.77746032, 0.85968254, 0.85047619, 0.84825397],
       [0.87051282, 0.98237179, 0.93333333, 0.93237179],
       [0.87888889, 0.88777778, 0.87444444, 0.9       ],
       [0.93336831, 0.94039874, 0.95127667, 0.91514516],
       [0.80530702, 0.81105263, 0.79635965, 0.77434211],
       [0.96410803, 0.97663113, 0.96260128, 0.95515281],
       [0.85438596, 0.95540936, 0.95307018, 0.94912281],
       [0.76830601, 0.76994536, 0.81693989, 0.76120219],
       [0.96704712, 0.99818554, 0.99482002, 0.99514194],
       [0.97278912, 0.96890185, 0.96482021, 0.96841594],
       [0.99911111, 0.99866667, 0.99644444, 0.98222222],
       [0.71765568, 0.8221978 , 0.82673993, 0.8378022 ],
       [0.77500868, 0.76340278, 0.86361111, 0.76003472],
       [0.94311558, 0.97283082, 0.96958124, 0.96566164],
       [0.94319527, 0.98682446, 0.98319527, 0.98601578],
       [0.89297778, 0.93928889, 0.95253333, 0.91848889],
       [1.        , 0.95520202, 0.9590404 , 0.94924242],
       [0.94488889, 0.94844444, 0.94211852, 0.94685185],
       [0.89105556, 0.93094444, 0.93805556, 0.93966667],
       [0.95833333, 0.97574074, 0.9862963 , 0.98018519],
       [0.62910136, 0.76677116, 0.75208986, 0.77105538],
       [0.89846154, 0.96384615, 0.96358974, 0.94692308],
       [0.76850829, 0.81777164, 0.8835175 , 0.77946593],
       [0.94929972, 0.98403361, 0.95322129, 0.97422969],
       [0.81300813, 0.81317073, 0.78162602, 0.79707317],
       [0.94842667, 0.96154667, 0.96976   , 0.96346667],
       [0.98055556, 0.98444444, 0.98444444, 0.98166667],
       [0.96177182, 0.9752466 , 0.95119114, 0.98203983],
       [0.84733333, 0.87733333, 0.628     , 0.872     ],
       [0.92069959, 0.93386831, 0.94065844, 0.92860082],
       [0.92394444, 0.92925556, 0.91241111, 0.92954444],
       [0.53393939, 0.52624242, 0.53442424, 0.45884848],
       [0.99383626, 0.99935673, 0.99574269, 0.99707602],
       [0.8374651 , 0.86315838, 0.83353806, 0.85799367],
       [0.69856115, 0.70935252, 0.66546763, 0.70527578],
       [0.80835465, 0.79445865, 0.82208014, 0.7398977 ],
       [0.83068182, 0.97916667, 0.93863636, 0.99090909],
       [0.88614815, 0.90348148, 0.55155556, 0.69525926],
       [0.82333333, 0.86186667, 0.86586667, 0.8216    ],
       [0.91699346, 0.92908497, 0.95087146, 0.94498911],
       [0.67640693, 0.68939394, 0.59437229, 0.67207792],
       [0.97827004, 0.99651899, 0.98407173, 0.9914557 ],
       [0.57346667, 0.81553333, 0.87513333, 0.7384    ],
       [0.99706667, 1.        , 1.        , 0.99998333],
       [0.96594595, 0.98027027, 0.9754955 , 0.97927928],
       [0.86969697, 0.96859504, 0.95247934, 0.9707989 ],
       [0.89553265, 0.89175258, 0.90630011, 0.87938144],
       [0.6015625 , 0.61927083, 0.625     , 0.63802083],
       [0.79611111, 0.92055556, 0.90111111, 0.93777778],
       [0.9654152 , 0.99445614, 0.94883041, 0.97963743],
       [0.85333333, 0.89633333, 0.89666667, 0.88866667],
       [0.94937198, 0.99301932, 0.8321256 , 0.9821256 ],
       [0.69604052, 0.78093923, 0.81473297, 0.7267035 ],
       [0.90833874, 0.9017421 , 0.89016989, 0.896572  ],
       [0.88923577, 0.9742439 , 0.97695935, 0.97630894],
       [0.98928571, 1.        , 0.99880952, 0.9952381 ],
       [1.        , 1.        , 1.        , 1.        ],
       [0.88166667, 0.92333333, 0.89333333, 0.92666667],
       [0.99969028, 0.99179249, 0.99585753, 0.99330236],
       [0.88628571, 0.98209524, 0.9727619 , 0.98419048],
       [0.96022222, 0.97266667, 0.88466667, 0.92066667],
       [0.98240741, 0.98611111, 0.97986111, 0.97476852],
       [1.        , 0.99989418, 1.        , 1.        ],
       [0.76791468, 0.76196627, 0.79325595, 0.76810913],
       [0.72358974, 0.84111111, 0.85333333, 0.83529915]])

If you have any questions about these results or the datasets, the best place to raise an issue is on the associated repo.

Plotting results

Once you have the results you want, you can compare classifiers with built-in aeon tools.

For example, you can draw a critical difference diagram [8]. This displays the average rank of each estimator over all datasets. It then groups estimators for which there is no significant difference in rank into cliques, shown with a solid bar. The diagram below has been performed using pairwise Wilcoxon signed-rank tests and forms cliques using the Holm correction for multiple testing as described in [9, 10]. Alpha value is 0.05 (default value).

In the example below using our data loaded above, InceptionTimeClassifier and RDSTClassifier are not significantly different in ranking. FreshPRINCEClassifier is signigicantly worse than all classifiers, while HIVECOTEV2 is significantly better than all.

[12]:
from aeon.visualisation import plot_critical_difference

plot_critical_difference(results_arr, classifiers, test="wilcoxon", correction="holm")
[12]:
(<Figure size 600x230 with 1 Axes>, <Axes: >)
../../_images/examples_benchmarking_reference_results_21_1.png

Besides plotting differences using the critical difference diagrams, different versions of boxplots can be plotted. Boxplots graphically demonstrates the locality, spread and skewness of the results.

[13]:
from aeon.visualisation import plot_boxplot

plot_boxplot(
    results_arr,
    classifiers,
    plot_type="boxplot",
)
[13]:
(<Figure size 1000x600 with 1 Axes>, <Axes: >)
../../_images/examples_benchmarking_reference_results_23_1.png

From the critical difference diagram above, we showed that InceptionTimeClassifier is not significantly better than RDSTClassifier. Now, if we want to specifically compare the results of these two approaches, we can plot a scatter in which each point is a pair of accuracies of both approaches. The number of W, T, and L is also included per approach in the legend.

[14]:
from aeon.visualisation import plot_pairwise_scatter

plot_pairwise_scatter(
    results_arr[:, 2],
    results_arr[:, 3],
    classifiers[2],
    classifiers[3],
)
[14]:
(<Figure size 800x800 with 1 Axes>,
 <Axes: xlabel='InceptionTimeClassifier accuracy\n(mean: 0.8743)', ylabel='RDSTClassifier accuracy\n(mean: 0.8763)'>)
../../_images/examples_benchmarking_reference_results_25_1.png

References

[1] Middlehurst, M., Schäfer, P. and Bagnall, A., 2024. Bake off redux: a review and experimental evaluation of recent time series classification algorithms. Data Mining and Knowledge Discovery, pp.1-74.

[2] Holder, C., Middlehurst, M. and Bagnall, A., 2024. A review and evaluation of elastic distance functions for time series clustering. Knowledge and Information Systems, 66(2), pp.765-809.

[3] Guijo-Rubio, D., Middlehurst, M., Arcencio, G., Silva, D.F. and Bagnall, A., 2024. Unsupervised feature based algorithms for time series extrinsic regression. Data Mining and Knowledge Discovery, pp.1-45.

[4] Middlehurst, M. and Bagnall, A., 2022, May. The freshprince: A simple transformation based pipeline time series classifier. In International Conference on Pattern Recognition and Artificial Intelligence (pp. 150-161). Cham: Springer International Publishing.

[5] Ismail Fawaz, H., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D.F., Weber, J., Webb, G.I., Idoumghar, L., Muller, P.A. and Petitjean, F., 2020. Inceptiontime: Finding alexnet for time series classification. Data Mining and Knowledge Discovery, 34(6), pp.1936-1962.

[6] Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A. and Bagnall, A., 2021. HIVE-COTE 2.0: a new meta ensemble for time series classification. Machine Learning, 110(11), pp.3211-3243.

[7] Guillaume, A., Vrain, C. and Elloumi, W., 2022, June. Random dilated shapelet transform: A new approach for time series shapelets. In International Conference on Pattern Recognition and Artificial Intelligence (pp. 653-664). Cham: Springer International Publishing.

[8] Garcia, S. and Herrera, F., 2008. An Extension on” Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of machine learning research, 9(12).

[9] Benavoli, A., Corani, G. and Mangili, F., 2016. Should we really use post-hoc tests based on mean-ranks?. The Journal of Machine Learning Research, 17(1), pp.152-161.

[10] Demšar, J., 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7, pp.1-30.


Generated using nbsphinx. The Jupyter notebook can be found here.