Data format

Here, we describe how the main data is stored inside the main data folder configured in the tab_bench.data.paths.Paths object (see the documentation on running the benchmark).

As file formats, we mostly use .yaml (for small, human-readable files), .msgpack.gz (for efficiently storing dicts, lists, etc.), and .npy (standard format for storing numpy arrays).

Algs folder

The following files are stored in algs/<alg_name>, see tab_bench.run.task_execution.TabBenchJobManager.add_jobs() for details on how they are stored:

  • tags.yaml contains a list of tags, which can be used to only load results for algs with certain tags.

  • extended_config.yaml contains a dictionary with the wrapper parameters, as well as the alg_name and the wrapper class name.

  • wrapper.pkl: Optionally, a pickled version (using dill) of the wrapper. (However, our code does not load these as pickle is an unsafe format.)

  • src: A folder containing the source files at the time of execution, as a backup.

Tasks folder

We store datasets (tasks) in folders tasks/<source_name>/<task_name>, where source_name and task_name are derived from how the tasks are imported (see also the tab_bench.data.tasks.TaskDescription class). In each of these folders, we store the following files:

  • x_cont.npy, x_cat.npy, y.npy store the three relevant tensors for the DictDataset (see the tab_models documentation).

  • task_info.yaml stores the information of a TaskInfo object.

Task collections folder

In task_collections/<coll_name>.yaml, we store the list of tasks that a task collection with name coll_name consists of.

Results folder

We store the results of experiments in the folder

results/<alg_name>/<source_name>/<task_name>/<k>-fold/<split_type>/<split_idx>. Here,

  • alg_name is the name given to the method,

  • source_name and task_name identify a task,

  • k refers to the number of cross-validation folds (training-validation, not test),

  • split_type is either random-split (usually the case) or default-split (not used in our benchmark),

  • split_idx is the index (starting from zero) of the trainval-test-split.

The results are stored in files metrics.yaml and other.msgpack.gz. The former contains only the errors in different metrics, the latter contains other things like predictions (if configured to be saved), best stopping epoch, and possibly optimized hyperparameters. These files are stored by tab_bench.run.results.ResultManager. The involved dictionaries are generated by tab_models.alg_interfaces.alg_interfaces.AlgInterface.eval().

Result summaries folder

Since loading the results directly can be slow, we store accumulated versions of them in a more efficient format. Specifically, tab_bench.run.task_execution.TabBenchJobManager.run_jobs() will call tab_bench.run.task_execution.results.save_summaries(), which will generate files result_summaries/<alg_name>/<source_name>/<task_name>/<k>-fold/metrics.msgpack.gz that contain the metrics results for all splits.

Other folders

  • Plots and LaTeX tables will be saved in the plots folder.

  • Results of estimating resource prediction parameters are saved in the resources folder.

  • Results of time measurements are saved in the times folder.

  • Downloaded datasets from the UCI repository are saved in the uci_download folder. They can be deleted after the data import in download_data.py is completed.

  • The tmp folder can be used for storing temporary files. When running experiments, methods can store intermediate results in a temporary folder in their respective results folder.