Overview of the `models` part

Scikit-learn interfaces

We provide scikit-learn interfaces for various methods in sklearn/sklearn_interfaces.py. These use the default parameter dictionaries defined in sklearn/default_params.py.

AlgInterface: more fine-grained control

We implement all our methods through subclassing AlgInterface in alg_interfaces/alg_interfaces.py. AlgInterface provides more functionality than scikit-learn interfaces, which is crucial for our benchmarking in pytabkit.bench. All our scikit-learn interfaces are wrappers around AlgInterface classes, using the sklearn.sklearn_base.AlgInterfaceEstimator base class. Compared to scikit-learn interfaces, AlgInterface provides the following additional features:

Vectorized evaluation on multiple train-validation-test splits (used by RealMLP-TD and RealMLP-TD-S).
Specification of train-validation-test splits, random seeds, temporary folder, custom loggers
Inclusion of required resource estimates (CPU RAM, GPU RAM, GPU usage, n_threads, time)
Evaluation on a list of metrics
Refitting with best found parameters

Hyperparameter handling

Hyperparameters are explicitly defined in scikit-learn constructors.

Elsewhere, we generally pass all configuration parameters as **kwargs, then the corresponding functions pick out the parameters that they need and pass the rest on to nested function calls. This allows for very convenient coding, but one has to pay attention for typos in parameter names, which will often not be caught. For example, one could have the following structure:

def fit(**kwargs):
    model = build_model(**kwargs)
    train_model(model, **kwargs)
    
def build_model(n_layers=4, **kwargs):
    ...
    
def train_model(model, lr=4e-2, batch_size=256, **kwargs):
    ...

We usually write **config instead of **kwargs. We also generally try to give unique names to parameters. For example, the epsilon parameter of the optimizer is called opt_eps and the epsilon parameter of label smoothing is called ls_eps.

Internal data representation

We represent datasets internally using the DictDataset class. It contains a dictionary of PyTorch tensors. In our case, there are usually three tensors: 'x_cont' for continuous features, 'x_cat' for categorical features (dtype=torch.long), and 'y' for labels. A DictDataset also contains a dictionary tensor_infos, which for each of these keys contains a TensorInfo object. The latter describes the number of features and, if applicable, the number of categories for each feature (for categorical variables or classification labels).

We reserve the category 0 as the category for missing values (and values that have not been known to exist at train time). Missing numerical values are currently not handled by the NN code, so they need to be encoded beforehand.

Data preprocessing (also available for other models)

Most models offer to customize the data preprocessing through the tfms parameter. This is done using the NN preprocessing code in nn_models.models.PreprocessingFactory (see the corresponding documentation page for an explanation of the Factory classes).

NN implementation

For the implementation of RealMLP, we extend and alter the typical PyTorch structure, see the documentation page on NN classes.

Vectorization

Due to the vectorization of NN models, we use different terms for similar things:

n_cv refers to the number of training-validation splits in cross-validation (bagging)
n_refit refers to the number of models that are refitted on training+validation data after the CV stage
n_tv_splits (or n_models) refers to the number of training-validation splits used in the current training (could be n_cv or n_refit)
n_tt_splits (or n_parallel) refers to the number of trainval-test splits used (this is normally 1 when used through the scikit-learn interface, but can be larger when using RealMLP through the benchmark)

Overview of the models part