create_multi_comparison_matrix¶

create_multi_comparison_matrix(df_results, save_path='./mcm', formats=None, used_statistic='Accuracy', plot_1v1_comparisons=False, higher_stat_better=True, include_pvalue=True, pvalue_test='wilcoxon', pvalue_test_params=None, pvalue_correction=None, pvalue_threshold=0.05, use_mean='mean-difference', order_stats='average-statistic', order_stats_increasing=False, dataset_column=None, precision=4, load_analysis=False, row_comparates=None, col_comparates=None, excluded_row_comparates=None, excluded_col_comparates=None, colormap='coolwarm', fig_size='auto', font_size='auto', colorbar_orientation='vertical', colorbar_value=None, win_tie_loss_labels=None, include_legend=True, show_symetry=True)[source]¶

Generate the Multi-Comparison Matrix (MCM) [1].

MCM summarises a set of results for multiple estimators evaluated on multiple datasets. The MCM is a heatmap that shows absolute performance and tests for significant difference. It is configurable inmany ways.

Parameters:

df_results: str or pd.DataFrame: A csv file containing results in n_problems,n_estimators format. The first row should contain the names of the estimators and the first column can contain the names of the problems if dataset_column is true.
save_path: str, default = ‘./mcm’: The output directory for the results. If you want to save the results with a different filename, you must include the filename in the path. (e.g., ‘./your_filename’)
formatsstr or list of str, default = None: File formats to save in the save_path. - If None, no files are saved. - Valid formats are ‘pdf’, ‘png’, ‘json’, ‘csv’, ‘tex’.
used_statistic: str, default = ‘Score’: Name of the metric being assesses (e.g. accuracy, error, mse).
save_as_json: bool, default = True: Whether or not to save the python analysis dict into a json file format.
plot_1v1_comparisons: bool, default = True: Whether or not to plot the 1v1 scatter results.
higher_stat_better: bool, default = True: The order on considering a win or a loss for a given statistics.
include_pvalue bool, default = True: Condition whether or not include a pvalue stats.
pvalue_test: str, default = ‘wilcoxon’: The statistical test to produce the pvalue stats. Currently only wilcoxon is supported.
pvalue_test_params: dict, default = None,: The default parameter set for the pvalue_test used. If pvalue_test is set to Wilcoxon, one should check the scipy.stats.wilcoxon parameters, in the case Wilcoxon is set and this parameter is None, then the default setup is {“zero_method”: “pratt”, “alternative”: “greater”}.
pvalue_correction: str, default = None: Correction to use for the pvalue significant test, None or “Holm”.
pvalue_threshold: float, default = 0.05: Threshold for considering a comparison is significant or not. If pvalue < pvalue_threshhold -> comparison is significant.
use_mean: str, default = ‘mean-difference’: The mean used to compare two estimators. The only option available is ‘mean-difference’ which is the difference between arithmetic mean over all datasets.
order_stats: str, default = ‘average-statistic’: The way to order the used_statistic, default setup orders by average statistic over all datasets. The options are: =============================================================== method what it does =============================================================== average-statistic average used_statistic over all datasets average-rank average rank over all datasets max-wins maximum number of wins over all datasets amean-amean average over difference of use_mean pvalue average pvalue over all comparates ================================================================
order_stats_increasing: bool, default = False: If True, the order_stats will be ordered in increasing order, otherwise they are ordered in decreasing order.
dataset_column: str, default = ‘dataset_name’: The name of the datasets column in the csv file.
precision: int, default = 4: The number of floating numbers after decimal point.
load_analysis: bool, default = False: If True attempts to load the analysis json file.
row_comparates: list of str, default = None: A list of included row comparates, if None, all of the comparates in the study are placed in the rows.
col_comparates: list of str, default = None: A list of included col comparates, if None, all of the comparates in the study are placed in the cols.
excluded_row_comparates: list of str, default = None: A list of excluded row comparates. If None, all comparates are included.
excluded_col_comparates: list of str, default = None: A list of excluded col comparates. If None, all comparates are included.
colormap: str, default = ‘coolwarm’: The colormap used in matplotlib, if set to None, no color map is used and the heatmap is turned off, no colors will be seen.
fig_size: str or tuple of two int, default = ‘auto’: The height and width of the figure, if ‘auto’, use _get_fig_size function in utils.py. Note that the fig size values are in matplotlib units.
font_size: int, default = 17: The font size of text.
colorbar_orientation: str, default = ‘vertical’: In which orientation to show the colorbar either horizontal or vertical.
colorbar_value: str, default = ‘mean-difference’: The values for which the heat map colors are based on.
win_tie_loss_labels: tuple of str or None, default = None: Custom labels for heatmap cells, in the form (win_label, tie_label, loss_label). If win_tie_loss_labels=None, default labels are chosen based on higher_stat_better: - If higher_stat_better=True, defaults to (‘r>c’, ‘r=c’, ‘r<c’) - If higher_stat_better=False, defaults to (‘r<c’, ‘r=c’, ‘r>c’) The tuple must contain exactly three strings, representing win, tie, and loss outcomes for the row comparate (r) against the column comparate (c).
include_legend: bool, default = True: Whether or not to show the legend on the MCM.
show_symetry: bool, default = True: Whether or not to show the symmetrical part of the heatmap.

Returns:

fig: plt.Figure: The figure object of the heatmap.

Notes

Developed from the code in https://github.com/MSD-IRIMAS/Multi_Comparison_Matrix

References

[1]

Ismail-Fawaz A. et al, An Approach To Multiple Comparison Benchmark

Evaluations That Is Stable Under Manipulation Of The Comparate Set arXiv preprint arXiv:2305.11921, 2023.