Data format
Here, we describe how the main data is stored
inside the main data folder configured in the tab_bench.data.paths.Paths object
(see the documentation on running the benchmark).
As file formats, we mostly use .yaml (for small, human-readable files),
.msgpack.gz (for efficiently storing dicts, lists, etc.), and .npy
(standard format for storing numpy arrays).
Algs folder
The following files are stored in algs/<alg_name>,
see tab_bench.run.task_execution.TabBenchJobManager.add_jobs()
for details on how they are stored:
tags.yamlcontains a list of tags, which can be used to only load results for algs with certain tags.extended_config.yamlcontains a dictionary with the wrapper parameters, as well as the alg_name and the wrapper class name.wrapper.pkl: Optionally, a pickled version (usingdill) of the wrapper. (However, our code does not load these as pickle is an unsafe format.)src: A folder containing the source files at the time of execution, as a backup.
Tasks folder
We store datasets (tasks) in folders tasks/<source_name>/<task_name>,
where source_name and task_name are derived from how the tasks are imported
(see also the tab_bench.data.tasks.TaskDescription class).
In each of these folders, we store the following files:
x_cont.npy,x_cat.npy,y.npystore the three relevant tensors for the DictDataset (see thetab_modelsdocumentation).task_info.yamlstores the information of aTaskInfoobject.
Task collections folder
In task_collections/<coll_name>.yaml,
we store the list of tasks that a task collection with name coll_name consists of.
Results folder
We store the results of experiments in the folder
results/<alg_name>/<source_name>/<task_name>/<k>-fold/<split_type>/<split_idx>.
Here,
alg_name is the name given to the method,
source_name and task_name identify a task,
k refers to the number of cross-validation folds (training-validation, not test),
split_type is either
random-split(usually the case) ordefault-split(not used in our benchmark),split_idx is the index (starting from zero) of the trainval-test-split.
The results are stored in files metrics.yaml and other.msgpack.gz.
The former contains only the errors in different metrics,
the latter contains other things like predictions (if configured to be saved),
best stopping epoch, and possibly optimized hyperparameters.
These files are stored by tab_bench.run.results.ResultManager.
The involved dictionaries are generated by
tab_models.alg_interfaces.alg_interfaces.AlgInterface.eval().
Result summaries folder
Since loading the results directly can be slow,
we store accumulated versions of them in a more efficient format. Specifically,
tab_bench.run.task_execution.TabBenchJobManager.run_jobs() will call
tab_bench.run.task_execution.results.save_summaries(), which will generate files
result_summaries/<alg_name>/<source_name>/<task_name>/<k>-fold/metrics.msgpack.gz
that contain the metrics results for all splits.
Other folders
Plots and LaTeX tables will be saved in the
plotsfolder.Results of estimating resource prediction parameters are saved in the
resourcesfolder.Results of time measurements are saved in the
timesfolder.Downloaded datasets from the UCI repository are saved in the
uci_downloadfolder. They can be deleted after the data import indownload_data.pyis completed.The
tmpfolder can be used for storing temporary files. When running experiments, methods can store intermediate results in a temporary folder in their respective results folder.