plot_significance

plot_significance(scores, labels, alpha=0.1, lower_better=False, test='wilcoxon', correction='holm', fontsize=12, reverse=True, return_p_values=False)[source]

Plot similar to CDDs, but allows the case where cliques can be deceiving.

It is a visual representation of the results of a statistical test to compare the performance of different classifiers. The plot is based on the work of Demsar [1]. The plot shows the average rank of each classifier and the average score of each method. The plot also shows the cliques of classifiers that are not significantly different from each other. The main difference against CDDs is that this plot allows the case where cliques can be deceiving, i.e. there is a gap where a method is significantly different from the rest of the clique. In the CDDs, the clique stop when there is a significant difference between the methods. In this plot, the clique continues until the end of the list of methods, showing a gap for the method with significant differences.

Parameters:
scoresnp.array

Array of shape (n_datasets, n_estimators) with the performance of each estimator in each dataset.

labelslist of str

List of length n_estimators with the name of each estimator.

alphafloat, default=0.1

The significance level used in the statistical test.

lower_betterbool, default=False

Whether lower scores are better than higher scores.

teststr, default=”wilcoxon”

The statistical test to use. Available tests are “nemenyi” and “wilcoxon”.

correctionstr, default=”holm”

The correction to apply to the p-values. Available corrections are “bonferroni”, “holm” and None.

fontsizeint, default=12

The fontsize of the text in the plot.

reversebool, default=True

Whether to reverse the order of the labels.

return_p_valuesbool, default=False

Whether to return the pairwise matrix of p-values.

Returns:
figmatplotlib.figure

Figure created.

axmatplotlib.axes

Axes of the figure.

p_valuesnp.ndarray (optional)

if return_p_values is True, returns a (n_estimators, n_estimators) matrix of unadjusted p values for the pairwise Wilcoxon sign rank test.

References

[1]

Demsar J., “Statistical comparisons of classifiers over multiple data sets.”

Journal of Machine Learning Research 7:1-30, 2006.

Examples

>>> from aeon.visualisation import plot_significance
>>> from aeon.benchmarking.results_loaders import get_estimator_results_as_array
>>> methods = ["IT", "WEASEL-Dilation", "HIVECOTE2", "FreshPRINCE"]
>>> results = get_estimator_results_as_array(estimators=methods) 
>>> plot = plot_significance(results[0], methods, alpha=0.1)        
>>> plot.show()  
>>> plot.savefig("significance_plot.pdf", bbox_inches="tight")