Hyperparameter optimization

This is a guide how to perform hyperparameter optimization (HPO) to get the best results out of RealMLP. We consider RealMLP for classification here, but most of the guide applies to regression and other baselines as well.

Option 1: Using the HPO interface

The easiest option is to use the direct HPO interface:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_HPO_Classifier

X, y = make_classification(random_state=42, n_samples=200, n_features=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

clf = RealMLP_HPO_Classifier(n_hyperopt_steps=10, n_cv=1, verbosity=2, val_metric_name='brier')
clf.fit(X_train, y_train)
clf.predict(X_test)

The code above

runs random search with 10 configurations from the HPO space in the paper (should be increased to, say, 50 for better results)
only uses one training-validation split (should be increased to, say, 5 for better results)
prints validation results of each epoch and best found parameters thanks to verbosity=2
selects the best model and best epoch based on the Brier score (default would be classification error)

While using the interface directly is convenient, it has certain drawbacks:

It is not possible to change the search space, e.g. to reduce label smoothing for other metrics than classification error.
It is not possible to save and resume from an intermediate state.
It is not possible to use another HPO method than random search.
It is not (easily) possible to access intermediate results.

Therefore, we now look at a more manual approach.

Option 2: Performing your own HPO

The following code provides an example on how to do HPO manually.

import numpy as np
import torch
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold

from pytabkit.models.alg_interfaces.nn_interfaces import RealMLPParamSampler
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_TD_Classifier
from pytabkit.models.training.metrics import Metrics

n_hyperopt_steps = 10
n_cv = 1
is_classification = True

X, y = make_classification(random_state=42, n_samples=200, n_features=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# We compute train-validation splits here instead of letting the sklearn interface do it
# such that we can compute the validation error ourselves
if n_cv == 1:
    # we cannot do 1-fold CV, so we do an 80%-20% train-validation split
    _, val_idxs = train_test_split(np.arange(X_train.shape[0]), test_size=0.2, random_state=0)
    val_idxs = val_idxs[None, :]
else:
    skf = StratifiedKFold(n_splits=n_cv, shuffle=True, random_state=0)
    val_idxs_list = [val_idxs for train_idxs, val_idxs in skf.split(X_train, y_train)]

    # make sure that each validation set has the same length, so we can exploit vectorization
    max_len = max([len(val_idxs) for val_idxs in val_idxs_list])
    val_idxs_list = [val_idxs[:max_len] for val_idxs in val_idxs_list]
    val_idxs = np.asarray(val_idxs_list)

best_val_loss = np.Inf
best_clf = None
best_params = None

for hpo_step in range(n_hyperopt_steps):
    # sample random params according to the proposed search space, but this can be replaced by a custom HPO method
    params = RealMLPParamSampler(is_classification=is_classification).sample_params(seed=hpo_step)

    # we only use one classifier that will fit n_cv sub-models, since RealMLP can vectorize the fitting,
    # but it would also be possible to use one classifier per cross-validation split.
    clf = RealMLP_TD_Classifier(**params, n_cv=n_cv, verbosity=2, val_metric_name='brier')
    clf.fit(X_train, y_train, val_idxs=val_idxs)

    # evaluate validation loss
    # for n_cv >= 2, predict_proba() only outputs averaged predictions of the cross-validation models,
    # but we need separate predictions of each of the cross-validation members to extract the out-of-bag ones,
    # so we use predict_proba_ensemble().
    # There is also predict_ensemble() which replaces predict().
    y_pred_prob = clf.predict_proba_ensemble(X_train)
    val_predictions = np.concatenate([y_pred_prob[i, val_idxs[i, :]] for i in range(n_cv)], axis=0)
    val_labels = np.concatenate([y_train[val_idxs[i, :]] for i in range(n_cv)], axis=0)
    val_logits = np.log(val_predictions + 1e-30)

    val_loss = Metrics.apply(torch.as_tensor(val_logits, dtype=torch.float32), torch.as_tensor(val_labels),
                             metric_name='brier').item()

    # update best model if loss improved
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_clf = clf
        best_params = params

best_clf.predict(X_test)
print(f'best params: {best_params}')

Here is the equivalent search space for hyperopt:

from hyperopt import hp
import numpy as np

space = {
    'num_emb_type': hp.choice('num_emb_type', ['none', 'pbld', 'pl', 'plr']),
    'add_front_scale': hp.pchoice('add_front_scale', [(0.6, True), (0.4, False)]),
    'lr': hp.loguniform('lr', np.log(2e-2), np.log(3e-1)),
    'p_drop': hp.pchoice('p_drop', [(0.3, 0.0), (0.5, 0.15), (0.2, 0.3)]),
    'wd': hp.choice('wd', [0.0, 2e-2]),
    'plr_sigma': hp.loguniform('plr_sigma', np.log(0.05), np.log(0.5)),
    'hidden_sizes': hp.pchoice('hidden_sizes', [(0.6, [256] * 3), (0.2, [64] * 5), (0.2, [512])]),
    'act': hp.choice('act', ['selu', 'mish', 'relu']),
    'ls_eps': hp.pchoice('ls_eps', [(0.3, 0.0), (0.7, 0.1)])
}