create_multi_comparison_matrix

create_multi_comparison_matrix(results, save_path='./mcm', formats=None, statistic_name='Score', plot_1v1_comparisons=False, higher_stat_better=True, include_pvalue=True, pvalue_test='wilcoxon', pvalue_test_params=None, pvalue_correction='Holm', pvalue_threshold=0.05, order_stats='average-statistic', order_stats_increasing=False, precision=4, row_comparates=None, col_comparates=None, excluded_row_comparates=None, excluded_col_comparates=None, colormap='coolwarm', fig_size='auto', font_size='auto', colorbar_orientation='vertical', colorbar_value=None, win_tie_loss_labels=None, include_legend=True, show_symmetry=True)[source]

Generate the Multi-Comparison Matrix (MCM) [1].

MCM summarises a set of results for multiple estimators evaluated on multiple datasets. The MCM is a heatmap that shows absolute performance and tests for significant difference. It is configurable inmany ways.

Note: this implementation uses different pvalue parameters from the original by default. To use the original parameters, set pvalue_test_params to {"zero_method": "pratt", "alternative": "two-sided"} and pvalue_correction to None.

Parameters:
results: pd.DataFrame

A dataframe of scores. Columns are the names of the estimators and rows are the different problems. The estimator names present in the columns will be used as the comparate names in the MCM.

save_path: str, default = ‘./mcm’

The output directory for the results. If you want to save the results with a different filename, you must include the filename in the path. (e.g., ‘./your_filename’)

formatsstr or list of str, default = None

File formats to save in the save_path. - If None, no files are saved. - Valid formats are ‘pdf’, ‘png’, ‘json’, ‘csv’, ‘tex’.

statistic_name: str, default = ‘Score’

Name of the metric being assessesed (e.g. accuracy, error, mse). By default just generically labelles as ‘Score’.

plot_1v1_comparisons: bool, default = True

Whether to plot the 1v1 scatter results.

higher_stat_better: bool, default = True

The order on considering a win or a loss for a given statistics.

include_pvalue bool, default = True

Condition whether include a pvalue stats.

pvalue_test: str, default = ‘wilcoxon’

The statistical test to produce the pvalue stats. Currently only wilcoxon is supported.

pvalue_test_params: dict, default = None,

The default parameter set for the pvalue_test used. If pvalue_test is set to Wilcoxon, one should check the scipy.stats.wilcoxon parameters, in the case Wilcoxon is set and this parameter is None, then the default setup is {“zero_method”: “wilcox”, “alternative”: “greater”}.

pvalue_correction: str, default = None

Correction to use for the pvalue significant test, None or “Holm”.

pvalue_threshold: float, default = 0.05

Threshold for considering a comparison is significant or not. If pvalue < pvalue_threshhold -> comparison is significant.

order_stats: str, default = ‘average-statistic’

The way to order the used_statistic, default setup orders by average statistic over all datasets. The options are: =============================================================== method what it does =============================================================== average-statistic average used_statistic over all datasets average-rank average rank over all datasets max-wins maximum number of wins over all datasets amean-amean average over difference of use_mean pvalue average pvalue over all comparates ================================================================

order_stats_increasing: bool, default = False

If True, the order_stats will be ordered in increasing order, otherwise they are ordered in decreasing order.

precision: int, default = 4

The number of floating numbers after decimal point.

row_comparates: list of str, default = None

A list of included row comparates, if None, all the comparates in the study are placed in the rows.

col_comparates: list of str, default = None

A list of included col comparates, if None, all the comparates in the study are placed in the cols.

excluded_row_comparates: list of str, default = None

A list of excluded row comparates. If None, all comparates are included.

excluded_col_comparates: list of str, default = None

A list of excluded col comparates. If None, all comparates are included.

colormap: str, default = ‘coolwarm’

The colormap used in matplotlib, if set to None, no color map is used and the heatmap is turned off, no colors will be seen.

fig_size: str or tuple of two int, default = ‘auto’

The height and width of the figure, if ‘auto’, use _get_fig_size function in utils.py. Note that the fig size values are in matplotlib units.

font_size: int, default = 17

The font size of text.

colorbar_orientation: str, default = ‘vertical’

In which orientation to show the colorbar either horizontal or vertical.

colorbar_value: str, default = ‘mean-difference’

The values for which the heat map colors are based on.

win_tie_loss_labels: tuple of str or None, default = None

Custom labels for heatmap cells, in the form (win_label, tie_label, loss_label). If win_tie_loss_labels=None, default labels are chosen based on higher_stat_better: - If higher_stat_better=True, defaults to (‘r>c’, ‘r=c’, ‘r<c’) - If higher_stat_better=False, defaults to (‘r<c’, ‘r=c’, ‘r>c’) The tuple must contain exactly three strings, representing win, tie, and loss outcomes for the row comparate (r) against the column comparate (c).

include_legend: bool, default = True

Whether to show the legend on the MCM.

show_symmetry: bool, default = True

Whether to show the symmetrical part of the heatmap.

Returns:
fig: plt.Figure

The figure object of the heatmap.

Notes

Developed from the code in https://github.com/MSD-IRIMAS/Multi_Comparison_Matrix

References

[1]

Ismail-Fawaz A. et al, An Approach To Multiple Comparison Benchmark

Evaluations That Is Stable Under Manipulation Of The Comparate Set arXiv preprint arXiv:2305.11921, 2023.