create_multi_comparison_matrix

create_multi_comparison_matrix(df_results, output_dir='./', pdf_savename=None, png_savename=None, csv_savename=None, tex_savename=None, used_statistic='Accuracy', save_as_json=False, plot_1v1_comparisons=False, order_win_tie_loss='higher', include_pvalue=True, pvalue_test='wilcoxon', pvalue_test_params=None, pvalue_correction=None, pvalue_threshold=0.05, use_mean='mean-difference', order_stats='average-statistic', order_better='decreasing', dataset_column=None, precision=4, load_analysis=False, row_comparates=None, col_comparates=None, excluded_row_comparates=None, excluded_col_comparates=None, colormap='coolwarm', fig_size='auto', font_size='auto', colorbar_orientation='vertical', colorbar_value=None, win_label='r>c', tie_label='r=c', loss_label='r<c', include_legend=True, show_symetry=True)[source]

Generate the Multi-Comparison Matrix (MCM) [1].

MCM summarises a set of results for multiple estimators evaluated on multiple datasets. The MCM is a heatmap that shows absolute performance and tests for significant difference. It is configurable inmany ways.

Parameters:
df_results: str or pd.DataFrame

A csv file containing results in n_problems,n_estimators format. The first row should contain the names of the estimators and the first column can contain the names of the problems if dataset_column is true.

output_dir: str, default = ‘./’

The output directory for the results.

pdf_savename: str, default = None

The name of the saved file into pdf format. if None, it will not be saved into this format.

png_savename: str, default = None

The name of the saved file into png format, if None, it will not be saved into this format.

csv_savename: str, default = None

The name of the saved file into csv format, if None, will not be saved into this format.

tex_savename: str, default = None

The name of the saved file into tex format, if None, will not be saved into this format.

used_statistic: str, default = ‘Score’

Name of the metric being assesses (e.g. accuracy, error, mse).

save_as_json: bool, default = True

Whether or not to save the python analysis dict into a json file format.

plot_1v1_comparisons: bool, default = True

Whether or not to plot the 1v1 scatter results.

order_win_tie_loss: str, default = ‘higher’

The order on considering a win or a loss for a given statistics.

include_pvalue bool, default = True

Condition whether or not include a pvalue stats.

pvalue_test: str, default = ‘wilcoxon’

The statistical test to produce the pvalue stats. Currently only wilcoxon is supported.

pvalue_test_params: dict, default = None,

The default parameter set for the pvalue_test used. If pvalue_test is set to Wilcoxon, one should check the scipy.stats.wilcoxon parameters, in the case Wilcoxon is set and this parameter is None, then the default setup is {“zero_method”: “pratt”, “alternative”: “greater”}.

pvalue_correction: str, default = None

Correction to use for the pvalue significant test, None or “Holm”.

pvalue_threshold: float, default = 0.05

Threshold for considering a comparison is significant or not. If pvalue < pvalue_threshhold -> comparison is significant.

use_mean: str, default = ‘mean-difference’

The mean used to compare two estimators. The only option available is ‘mean-difference’ which is the difference between arithmetic mean over all datasets.

order_stats: str, default = ‘average-statistic’

The way to order the used_statistic, default setup orders by average statistic over all datasets. The options are: =============================================================== method what it does =============================================================== average-statistic average used_statistic over all datasets average-rank average rank over all datasets max-wins maximum number of wins over all datasets amean-amean average over difference of use_mean pvalue average pvalue over all comparates ================================================================

order_better: str, default = ‘decreasing’

By which order to sort stats, from best to worse.

dataset_column: str, default = ‘dataset_name’

The name of the datasets column in the csv file.

precision: int, default = 4

The number of floating numbers after decimal point.

load_analysis: bool, default = False

If True attempts to load the analysis json file.

row_comparates: list of str, default = None

A list of included row comparates, if None, all of the comparates in the study are placed in the rows.

col_comparates: list of str, default = None

A list of included col comparates, if None, all of the comparates in the study are placed in the cols.

excluded_row_comparates: list of str, default = None

A list of excluded row comparates. If None, all comparates are included.

excluded_col_comparates: list of str, default = None

A list of excluded col comparates. If None, all comparates are included.

colormap: str, default = ‘coolwarm’

The colormap used in matplotlib, if set to None, no color map is used and the heatmap is turned off, no colors will be seen.

fig_size: str ot tuple of two int, default = ‘auto’

The height and width of the figure, if ‘auto’, use _get_fig_size function in utils.py. Note that the fig size values are in matplotlib units.

font_size: int, default = 17

The font size of text.

colorbar_orientation: str, default = ‘vertical’

In which orientation to show the colorbar either horizontal or vertical.

colorbar_value: str, default = ‘mean-difference’

The values for which the heat map colors are based on.

win_label: str, default = “r>c”

The winning label to be set on the MCM.

tie_label: str, default = “r=c”

The tie label to be set on the MCM.

loss_label: str, default = “r<c”

The loss label to be set on the MCM.

include_legend: bool, default = True

Whether or not to show the legend on the MCM.

show_symetry: bool, default = True

Whether or not to show the symetrical part of the heatmap.

Returns:
fig: plt.Figure

The figure object of the heatmap.

Notes

Developed from the code in https://github.com/MSD-IRIMAS/Multi_Comparison_Matrix

References

[1]

Ismail-Fawaz A. et al, An Approach To Multiple Comparison Benchmark

Evaluations That Is Stable Under Manipulation Of The Comparate Set arXiv preprint arXiv:2305.11921, 2023.