create_multi_comparison_matrix

create_multi_comparison_matrix(df_results, save_path='./mcm', formats=None, used_statistic='Accuracy', plot_1v1_comparisons=False, higher_stat_better=True, include_pvalue=True, pvalue_test='wilcoxon', pvalue_test_params=None, pvalue_correction=None, pvalue_threshold=0.05, use_mean='mean-difference', order_stats='average-statistic', order_stats_increasing=False, dataset_column=None, precision=4, load_analysis=False, row_comparates=None, col_comparates=None, excluded_row_comparates=None, excluded_col_comparates=None, colormap='coolwarm', fig_size='auto', font_size='auto', colorbar_orientation='vertical', colorbar_value=None, win_tie_loss_labels=None, include_legend=True, show_symetry=True)[source]

Generate the Multi-Comparison Matrix (MCM) [1].

MCM summarises a set of results for multiple estimators evaluated on multiple datasets. The MCM is a heatmap that shows absolute performance and tests for significant difference. It is configurable inmany ways.

Parameters:
df_results: str or pd.DataFrame

A csv file containing results in n_problems,n_estimators format. The first row should contain the names of the estimators and the first column can contain the names of the problems if dataset_column is true.

save_path: str, default = ‘./mcm’

The output directory for the results. If you want to save the results with a different filename, you must include the filename in the path. (e.g., ‘./your_filename’)

formatsstr or list of str, default = None

File formats to save in the save_path. - If None, no files are saved. - Valid formats are ‘pdf’, ‘png’, ‘json’, ‘csv’, ‘tex’.

used_statistic: str, default = ‘Score’

Name of the metric being assesses (e.g. accuracy, error, mse).

save_as_json: bool, default = True

Whether or not to save the python analysis dict into a json file format.

plot_1v1_comparisons: bool, default = True

Whether or not to plot the 1v1 scatter results.

higher_stat_better: bool, default = True

The order on considering a win or a loss for a given statistics.

include_pvalue bool, default = True

Condition whether or not include a pvalue stats.

pvalue_test: str, default = ‘wilcoxon’

The statistical test to produce the pvalue stats. Currently only wilcoxon is supported.

pvalue_test_params: dict, default = None,

The default parameter set for the pvalue_test used. If pvalue_test is set to Wilcoxon, one should check the scipy.stats.wilcoxon parameters, in the case Wilcoxon is set and this parameter is None, then the default setup is {“zero_method”: “pratt”, “alternative”: “greater”}.

pvalue_correction: str, default = None

Correction to use for the pvalue significant test, None or “Holm”.

pvalue_threshold: float, default = 0.05

Threshold for considering a comparison is significant or not. If pvalue < pvalue_threshhold -> comparison is significant.

use_mean: str, default = ‘mean-difference’

The mean used to compare two estimators. The only option available is ‘mean-difference’ which is the difference between arithmetic mean over all datasets.

order_stats: str, default = ‘average-statistic’

The way to order the used_statistic, default setup orders by average statistic over all datasets. The options are: =============================================================== method what it does =============================================================== average-statistic average used_statistic over all datasets average-rank average rank over all datasets max-wins maximum number of wins over all datasets amean-amean average over difference of use_mean pvalue average pvalue over all comparates ================================================================

order_stats_increasing: bool, default = False

If True, the order_stats will be ordered in increasing order, otherwise they are ordered in decreasing order.

dataset_column: str, default = ‘dataset_name’

The name of the datasets column in the csv file.

precision: int, default = 4

The number of floating numbers after decimal point.

load_analysis: bool, default = False

If True attempts to load the analysis json file.

row_comparates: list of str, default = None

A list of included row comparates, if None, all of the comparates in the study are placed in the rows.

col_comparates: list of str, default = None

A list of included col comparates, if None, all of the comparates in the study are placed in the cols.

excluded_row_comparates: list of str, default = None

A list of excluded row comparates. If None, all comparates are included.

excluded_col_comparates: list of str, default = None

A list of excluded col comparates. If None, all comparates are included.

colormap: str, default = ‘coolwarm’

The colormap used in matplotlib, if set to None, no color map is used and the heatmap is turned off, no colors will be seen.

fig_size: str or tuple of two int, default = ‘auto’

The height and width of the figure, if ‘auto’, use _get_fig_size function in utils.py. Note that the fig size values are in matplotlib units.

font_size: int, default = 17

The font size of text.

colorbar_orientation: str, default = ‘vertical’

In which orientation to show the colorbar either horizontal or vertical.

colorbar_value: str, default = ‘mean-difference’

The values for which the heat map colors are based on.

win_tie_loss_labels: tuple of str or None, default = None

Custom labels for heatmap cells, in the form (win_label, tie_label, loss_label). If win_tie_loss_labels=None, default labels are chosen based on higher_stat_better: - If higher_stat_better=True, defaults to (‘r>c’, ‘r=c’, ‘r<c’) - If higher_stat_better=False, defaults to (‘r<c’, ‘r=c’, ‘r>c’) The tuple must contain exactly three strings, representing win, tie, and loss outcomes for the row comparate (r) against the column comparate (c).

include_legend: bool, default = True

Whether or not to show the legend on the MCM.

show_symetry: bool, default = True

Whether or not to show the symmetrical part of the heatmap.

Returns:
fig: plt.Figure

The figure object of the heatmap.

Notes

Developed from the code in https://github.com/MSD-IRIMAS/Multi_Comparison_Matrix

References

[1]

Ismail-Fawaz A. et al, An Approach To Multiple Comparison Benchmark

Evaluations That Is Stable Under Manipulation Of The Comparate Set arXiv preprint arXiv:2305.11921, 2023.