# Overview of the `models` part ## Scikit-learn interfaces We provide scikit-learn interfaces for various methods in `sklearn/sklearn_interfaces.py`. These use the default parameter dictionaries defined in `sklearn/default_params.py`. ## AlgInterface: more fine-grained control We implement all our methods through subclassing `AlgInterface` in `alg_interfaces/alg_interfaces.py`. `AlgInterface` provides more functionality than scikit-learn interfaces, which is crucial for our benchmarking in `pytabkit.bench`. All our scikit-learn interfaces are wrappers around `AlgInterface` classes, using the `sklearn.sklearn_base.AlgInterfaceEstimator` base class. Compared to scikit-learn interfaces, `AlgInterface` provides the following additional features: - Vectorized evaluation on multiple train-validation-test splits (used by RealMLP-TD and RealMLP-TD-S). - Specification of train-validation-test splits, random seeds, temporary folder, custom loggers - Inclusion of required resource estimates (CPU RAM, GPU RAM, GPU usage, n_threads, time) - Evaluation on a list of metrics - Refitting with best found parameters ## Hyperparameter handling Hyperparameters are explicitly defined in scikit-learn constructors. Elsewhere, we generally pass all configuration parameters as **kwargs, then the corresponding functions pick out the parameters that they need and pass the rest on to nested function calls. This allows for very convenient coding, but one has to pay attention for typos in parameter names, which will often not be caught. For example, one could have the following structure: ```python def fit(**kwargs): model = build_model(**kwargs) train_model(model, **kwargs) def build_model(n_layers=4, **kwargs): ... def train_model(model, lr=4e-2, batch_size=256, **kwargs): ... ``` We usually write `**config` instead of `**kwargs`. We also generally try to give unique names to parameters. For example, the epsilon parameter of the optimizer is called `opt_eps` and the epsilon parameter of label smoothing is called `ls_eps`. ## Internal data representation We represent datasets internally using the `DictDataset` class. It contains a dictionary of PyTorch tensors. In our case, there are usually three tensors: `'x_cont'` for continuous features, `'x_cat'` for categorical features (`dtype=torch.long`), and `'y'` for labels. A `DictDataset` also contains a dictionary `tensor_infos`, which for each of these keys contains a `TensorInfo` object. The latter describes the number of features and, if applicable, the number of categories for each feature (for categorical variables or classification labels). We reserve the category `0` as the category for missing values (and values that have not been known to exist at train time). Missing numerical values are currently not handled by the NN code, so they need to be encoded beforehand. ## Data preprocessing (also available for other models) Most models offer to customize the data preprocessing through the `tfms` parameter. This is done using the NN preprocessing code in `nn_models.models.PreprocessingFactory` (see the corresponding documentation page for an explanation of the Factory classes). ## NN implementation For the implementation of RealMLP, we extend and alter the typical PyTorch structure, see the documentation page on NN classes. ## Vectorization Due to the vectorization of NN models, we use different terms for similar things: - `n_cv` refers to the number of training-validation splits in cross-validation (bagging) - `n_refit` refers to the number of models that are refitted on training+validation data after the CV stage - `n_tv_splits` (or `n_models`) refers to the number of training-validation splits used in the current training (could be `n_cv` or `n_refit`) - `n_tt_splits` (or `n_parallel`) refers to the number of trainval-test splits used (this is normally 1 when used through the scikit-learn interface, but can be larger when using RealMLP through the benchmark)