# Data format

Here, we describe how the main data is stored 
inside the main data folder configured in the `tab_bench.data.paths.Paths` object
(see the documentation on running the benchmark).

As file formats, we mostly use `.yaml` (for small, human-readable files),
`.msgpack.gz` (for efficiently storing dicts, lists, etc.), and `.npy` 
(standard format for storing numpy arrays).

## Algs folder

The following files are stored in `algs/<alg_name>`, 
see `tab_bench.run.task_execution.TabBenchJobManager.add_jobs()`
for details on how they are stored:

- `tags.yaml` contains a list of tags, 
which can be used to only load results for algs with certain tags.
- `extended_config.yaml` contains a dictionary with the wrapper parameters, 
as well as the alg_name and the wrapper class name.
- `wrapper.pkl`: Optionally, a pickled version (using `dill`) of the wrapper. 
(However, our code does not load these as pickle is an unsafe format.)
- `src`: A folder containing the source files at the time of execution, as a backup.

## Tasks folder

We store datasets (tasks) in folders `tasks/<source_name>/<task_name>`, 
where source_name and task_name are derived from how the tasks are imported
(see also the `tab_bench.data.tasks.TaskDescription` class).
In each of these folders, we store the following files:

- `x_cont.npy`, `x_cat.npy`, `y.npy` store the three relevant tensors 
for the DictDataset
(see the `tab_models` documentation).
- `task_info.yaml` stores the information of a `TaskInfo` object.

## Task collections folder

In `task_collections/<coll_name>.yaml`, 
we store the list of tasks that a task collection with name `coll_name` consists of.

## Results folder

We store the results of experiments in the folder

`results/<alg_name>/<source_name>/<task_name>/<k>-fold/<split_type>/<split_idx>`.
Here, 

- alg_name is the name given to the method, 
- source_name and task_name identify a task, 
- k refers to the number of cross-validation folds (training-validation, not test),
- split_type is either `random-split` (usually the case) 
or `default-split` (not used in our benchmark),
- split_idx is the index (starting from zero) of the trainval-test-split.

The results are stored in files `metrics.yaml` and `other.msgpack.gz`. 
The former contains only the errors in different metrics, 
the latter contains other things like predictions (if configured to be saved), 
best stopping epoch, and possibly optimized hyperparameters.
These files are stored by `tab_bench.run.results.ResultManager`.
The involved dictionaries are generated by 
`tab_models.alg_interfaces.alg_interfaces.AlgInterface.eval()`.

## Result summaries folder

Since loading the results directly can be slow, 
we store accumulated versions of them in a more efficient format. Specifically,
`tab_bench.run.task_execution.TabBenchJobManager.run_jobs()` will call
`tab_bench.run.task_execution.results.save_summaries()`, which will generate files
`result_summaries/<alg_name>/<source_name>/<task_name>/<k>-fold/metrics.msgpack.gz`
that contain the metrics results for all splits.

## Other folders

- Plots and LaTeX tables will be saved in the `plots` folder.
- Results of estimating resource prediction parameters 
are saved in the `resources` folder.
- Results of time measurements are saved in the `times` folder.
- Downloaded datasets from the UCI repository are saved in the `uci_download` folder.
They can be deleted after the data import in `download_data.py` is completed.
- The `tmp` folder can be used for storing temporary files.
When running experiments, methods can store intermediate results 
in a temporary folder in their respective results folder.