Running the benchmark

Configuration of data paths

The paths for storing data and results are configured through the tab_bench.data.paths.Paths class. There are several options to configure which folders are used, which will be automatically recognized by Paths.from_env_variables():

Through environmental variables: The base folder can be configured by setting the environmental variable TAB_BENCH_DATA_BASE_FOLDER. Optionally, some sub-folders can be set separately (e.g. for moving them to another partition). These are TAB_BENCH_DATA_TASKS_FOLDER, TAB_BENCH_DATA_RESULTS_FOLDER, TAB_BENCH_DATA_RESULT_SUMMARIES_FOLDER, TAB_BENCH_DATA_UCI_DOWNLOAD_FOLDER.
Through a python file: If TAB_BENCH_DATA_BASE_FOLDER is not available, the code will try to get the base folder (as a string) from scripts.custom_paths.get_base_folder(). This can be implemented by copying scripts/custom_paths.py.default to scripts/custom_paths.py (ignored by git) and adjusting the path therein.
If neither of the two options above is used, all data will be stored in ./tab_bench_data.

Download datasets

To download all datasets for the meta-train and meta-test benchmarks, run (with your desired OpenML cache directory, optionally)

python3 scripts/download_data.py openml_cache_dir --import_meta_train --import_meta_test --import_grinsztajn_medium

To run methods on the benchmarks, there are two options:

Run experiments with slurm

Our benchmarking code contains its own scheduling code that will start subprocesses for each algorithm-dataset-split combination. Therefore, it is in principle possible to run all experiments through a single slurm job, though experiments can be divided into smaller pieces by running them separately.

First, in scripts/ray_slurm_template.sh, replace the line cd ~/git/pytabkit according to your folder location. Also, make sure that the data path is specified there if you want to set it via an environmental variable. Run the following command (replacing some of the parameters with your own values) on the login node:

python3 scripts/ray_slurm_launch.py --exp_name=my_exp_name --num_nodes=num_nodes --queue="queue_name" --time=24:00:00 --mail_user="my@address.edu" --log_folder=log_folder --command="python3 -u scripts/run_slurm.py"

This will submit a job to the configured queue that will run scripts/run_slurm.py and create logfiles. Your experiments then have to be configured in scripts/run_slurm.py, see below. Multi-node is supported: ray will start instances on each node and our benchmarking code will schedule the individual experiments on the nodes.

Run experiments without slurm

Run the file with the corresponding experiments directly. For example, many of our experiment configurations can be found in scripts/run_experiments.py. One possible way to run the experiments detached from the shell with log-files is

systemd-run --scope --user python3 -u scripts/run_experiments.py > ./out.log 2> ./err.log &

Time measurements

For time measurements, simply run scripts/run_time_measurements.py (with or without slurm). Results can be printed using scripts/print_runtimes.py (but these are averaged total times, not averaged per 1K samples as in the paper).

Evaluating the benchmark results

Aggregated algorithm results can be printed using

python3 scripts/run_evaluation.py meta-train-class

where meta-train-class can be replaced by the name of any other task collection (that is stored in the task_collections folder in the configured data directory), or a single dataset such as openml-class/Higgs. This script also has many more command line options, see the python file. For example, one can print only those methods with a certain tag using the --tag option, print results on individual datasets, for different metrics, etc. The parameters are the same as the ones of the following method:

scripts.run_evaluation.show_eval(coll_name='meta-train-class', n_cv=1, show_alg_groups=True, val_metric_name=None, metric_name=None, split_type='random-split', use_task_weighting=None, shift_eps=0.01, data_path=None, alg_name=None, alg_name_2=None, tag=None, max_n_splits=None, max_n_algs=None, show_val_results=False, show_train_results=False, algs_prefix=None, algs_suffix=None, algs_contains=None, exclude_datasets=None)

Prints evaluation tables on the selected datasets/algorithms. The following aggregate statistics will be printed, all of which are based on the specified metric and validation metric:

log shifted geometric mean test metric when greedily creating an algorithm portfolio based on the validation results. The algorithms are sorted by order of inclusion into the portfolio. The scores are the scores of selecting the best algorithm out of the portfolio up to this point on every dataset separately, based on the validation sets.
Win fraction: Fraction of datasets (may be weighted) on which this algorithm is the best one.
Arithmetic mean rank
Arithmetic mean normalized test metric: The best method is normalized to 0 and the worst one to 1.
Arithmetic mean test metric
Log shifted geometric mean test metric: mean(log(metric+shift_eps))
Shifted geometric mean test metric: exp(mean(log(metric+shift_eps)))

Parameters:

coll_name (str) – Name of the task collection, e.g., ‘meta-train-class’
n_cv (int) – Number of cross-validation folds. Will only print results for algorithms that have been evaluated with this number of cross-validation folds.
show_alg_groups (bool) – Whether to show aggregate algorithms, such as the one that picks the best method on the validation set out of the displayed methods.
val_metric_name (str | None) – Name of the validation metric, used for the algorithm groups. By default, the same value as metric_name will be used.
metric_name (str | None) – Name of the metric that should be displayed (default = classification error / RMSE).
split_type (str) – Type of the split, normally random_split.
use_task_weighting (bool | None) – Whether to weight tasks for the evaluation. If false, uniform weights are used. If True, weights based on prefixes are used. By default, weights are used only for meta-train collections.
shift_eps (float) – Epsilon parameter used in the shifted geometric mean.
data_path (str | None) – Path to the data folder where results are saved. By default, this function will take the path from Paths.from_env_variables().
alg_name (str | None) – Algorithm for which results on individual datasets should be printed
alg_name_2 (str | None) – Second algorithm for which results on individual datasets should be printed.
tag (str | None) – If specified, only print algorithms whose tags include the given tag.
max_n_splits (int | None) – If specified, only evaluate the given number of train-test splits.
max_n_algs (int | None) – Maximum number of methods that should be processed and displayed.
show_val_results (bool)
show_train_results (bool)
algs_prefix (str | None)
algs_suffix (str | None)
algs_contains (str | None)
exclude_datasets (str | None)

This does not contain groups of methods (e.g. “all algs”) that will be added on top later. :param show_val_results: Whether to show validation errors instead of test errors. :param show_train_results: Whether to show training errors instead of test errors. :param algs_prefix: If specified, only methods with this prefix will be displayed. :param algs_suffix: If specified, only methods with this suffix will be displayed. :param algs_contains: If specified, only methods containing this substring will be displayed. :param exclude_datasets: Optional comma-separated list of datasets that will be excluded from the analysis. :return:

Creating plots and tables

Plots and tables can be created using

python3 scripts/create_plots_and_tables.py

The plots without missing value datasets require running

python3 scripts/check_missing_values.py

once beforehand.

Single-task experiments

You can also run a configuration on a single data set, without saving the results, by adjusting and running scripts/run_single_task.py.

Other utilities

Use scripts/analyze_tasks.py to print some dataset statistics.
You can rename a method using python3 scripts/rename_alg.py old_name new_name.
We used some code in scripts/meta_hyperopt.py to optimize the default parameters for GBDTs.
The code in scripts/estimate_resource_params.py has been used to get more precise estimates for RAM usage etc. for running methods on the benchmark.
scripts/print_complete_results.py can be used to check which methods have results available on all splits for all tasks in a given collection.