Scikit-learn interfaces

We provide scikit-learn interfaces for numerous methods in pytabkit.models.sklearn.sklearn_interfaces. Below, we provide an overview. All of our interfaces allow to specify the validation set(s) and categorical features in the fit method:

pytabkit.models.sklearn.sklearn_base.AlgInterfaceEstimator.fit(self, X, y, X_val=None, y_val=None, val_idxs=None, cat_indicator=None, cat_col_names=None, time_to_fit_in_seconds=None)

Fit the estimator.

Parameters:

X – Inputs (covariates). pandas DataFrame, numpy array, or similar array-like.
y – Labels (targets, variates). pandas DataFrame/Series, numpy array, or similar array-like.
X_val (Optional) – Inputs for validation set. Can only be used if n_cv is not set to a value other than 1, and if val_idxs is not used. If X_val is used, X will be used for the training set only, instead of getting validation data from X.
y_val (Optional) – Labels for the validation set.
val_idxs (ndarray | None) – Indices of validation set elements within X and y (optional). Can be an array of shape (n_val_samples,) or (n_val_splits,n_val_samples_per_split). In the latter case, the results of the models on the validation splits will be ensembled.
cat_indicator (List[bool] | ndarray | None) – Which features/columns are categorical, specified as a list or array of booleans. If this is not specified, all columns with category/string/object dtypes are interpreted as categorical and all others as numerical.
cat_col_names (List[str] | None) – List of column names that should be treated as categorical (if X is a pd.DataFrame). Can be specified instead of cat_indicator.
time_to_fit_in_seconds (int | None) – Time limit in seconds for fitting. Currently only implemented for RealMLP (default=None). If None, no time limit will be applied.

Returns:

Returns self.

Return type:

BaseEstimator

Important: For HPO and ensemble interfaces, it is recommended to set tmp_folder to allow these methods to store fitted models instead of holding them in the RAM. This means that tmp_folder should not be deleted while the associated interface still exists (even when it is pickled).

RealMLP

For RealMLP, we provide TD (tuned default), HPO (hyperparameter optimization with random search), and Ensemble (weighted ensembling of random search configurations) variants:

RealMLP_TD_Classifier
RealMLP_TD_Regressor
RealMLP_HPO_Classifier
RealMLP_HPO_Regressor
RealMLP_Ensemble_Classifier
RealMLP_Ensemble_Regressor

While the TD variants have good defaults, they provide the option to override any hyperparameters. The classifier and regressor have the same hyperparameters, therefore we only show the constructor of the classifier here. The first parameters until (including) verbosity are provided for every scikit-learn interface, although random_state, n_threads, tmp_folder, and verbosity may be ignored by some of the methods.

pytabkit.models.sklearn.sklearn_interfaces.RealMLP_TD_Classifier.__init__(self, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, train_metric_name=None, val_metric_name=None, n_epochs=None, batch_size=None, predict_batch_size=None, hidden_sizes=None, n_hidden_layers=None, hidden_width=None, tfms=None, num_emb_type=None, use_plr_embeddings=None, plr_sigma=None, plr_hidden_1=None, plr_hidden_2=None, plr_act_name=None, plr_use_densenet=None, plr_use_cos_bias=None, plr_lr_factor=None, max_one_hot_cat_size=None, embedding_size=None, act=None, use_parametric_act=None, act_lr_factor=None, weight_param=None, weight_init_mode=None, weight_init_gain=None, weight_lr_factor=None, bias_init_mode=None, bias_lr_factor=None, bias_wd_factor=None, add_front_scale=None, scale_lr_factor=None, first_layer_lr_factor=None, block_str=None, first_layer_config=None, last_layer_config=None, middle_layer_config=None, p_drop=None, p_drop_sched=None, wd=None, wd_sched=None, opt=None, lr=None, lr_sched=None, mom=None, mom_sched=None, sq_mom=None, sq_mom_sched=None, opt_eps=None, opt_eps_sched=None, normalize_output=None, clamp_output=None, use_ls=None, ls_eps=None, ls_eps_sched=None, use_early_stopping=None, early_stopping_additive_patience=None, early_stopping_multiplicative_patience=None, calibration_method=None, sort_quantile_predictions=None, stop_epoch=None, use_best_mean_epoch_for_cv=None, n_ens=None, ens_av_before_softmax=None)

Constructor for RealMLP, using the default parameters from RealMLP-TD. For lists of default parameters, we refer to pytabkit.models.sklearn.default_params.DefaultParams. RealMLP-TD does automatic preprocessing, so no manual preprocessing is necessary except for imputing missing numerical values.

Tips for modifications:

For faster training: For large datasets (say >50K samples), especially on GPUs, increase batch_size. It can also help to decrease n_epochs, set use_plr_embeddings=False (in case of many numerical features), increase max_one_hot_cat_size (in case of large-cardinality categories), or set use_parametric_act=False
For more accuracy: You can try increasing n_epochs or hidden_sizes while also decreasing lr.
For classification, if you care about metrics like cross-entropy or AUC instead of accuracy, we recommend setting val_metric_name=’cross_entropy’ and use_ls=False.

Parameters:

device (str | None) – PyTorch device name like ‘cpu’, ‘cuda’, ‘cuda:0’, ‘mps’ (default=None). If None, ‘cuda’ will be used if available, otherwise ‘cpu’.
random_state (int | RandomState | None) – Random state to use for random number generation (splitting, initialization, batch shuffling). If None, the behavior is not deterministic.
n_cv (int) – Number of cross-validation splits to use (default=1). If validation set indices or an explicit validation set are given in fit(), n_cv models will be fitted using different random seeds. Otherwise, n_cv-fold cross-validation will be used (stratified for classification). For n_cv=1, a single train-validation split will be used, where val_fraction controls the fraction of validation samples. If n_refit=0 is set, the prediction will use the average of the models fitted during cross-validation. (Averaging is over probabilities for classification, and over outputs for regression.) Otherwise, refitted models will be used.
n_refit (int) – Number of models that should be refitted on the training+validation dataset (default=0). If zero, only the models from the cross-validation stage are used. If positive, n_refit models will be fitted on the training+validation dataset (all data given in fit()) and their predictions will be averaged during predict().
n_repeats (int) – Number of times that the (cross-)validation split should be repeated (default=1). Values != 1 are only allowed when no custom validation split is provided. Larger number of repeats make things slower but reduce the potential for validation set overfitting, especially on smaller datasets.
val_fraction (float) – Fraction of samples used for validation (default=0.2). Has to be in [0, 1). Only used if n_cv==1 and no validation split is provided in fit().
n_threads (int | None) – Number of threads that the method is allowed to use (default=number of physical cores).
tmp_folder (str | Path | None) – Temporary folder in which data can be stored during fit(). (Currently unused for RealMLP-TD and variants.) If None, methods generally try to not store intermediate data. Note that HPO and ensemble methods can use this to reduce RAM usage by storing fitted models, and will need this folder to be available whenever they are used.
verbosity (int) – Verbosity level (default=0, higher means more verbose). Set to 2 to see logs from intermediate epochs.
train_metric_name (str | None) – Name of the training metric (default=’cross_entropy’ for classification and ‘mse’ for regression). Currently most other metrics are not available for training.
val_metric_name (str | None) – Name of the validation metric (used for selecting the best epoch). Defaults are ‘class_error’ for classification and ‘rmse’ for regression. Main available classification metrics (all to be minimized): ‘class_error’, ‘cross_entropy’, ‘1-auc_ovo’, ‘1-auc_ovr’, ‘1-auc_mu’, ‘brier’, ‘1-balanced_accuracy’, ‘1-mcc’, ‘ece’. Main available regression metrics: ‘rmse’, ‘mae’, ‘max_error’, ‘pinball(0.95)’ (also works with other quantiles specified directly in the string). For more metrics, we refer to models.training.metrics.Metrics.apply().
n_epochs (int | None) – Number of epochs to train the model for (default=256)
batch_size (int | None) – Batch size to be used for fit(), default=256.
predict_batch_size (int | None) – Batch size to be used for predict(), default=1024.
hidden_sizes (List[int] | Literal['rectangular'] | None) – List of numbers of neurons for each hidden layer, default=[256, 256, 256]. If this is set to ‘rectangular’, then [hidden_width] * n_hidden_layers will be used instead.
n_hidden_layers (int | None) – Number of hidden layers, default=3. Only used if hidden_sizes==’rectangular’.
hidden_width (int | None) – Width of each hidden layer, default=256. Only used if hidden_sizes==’rectangular’.
tfms (List[str] | None) – List of preprocessing transformations, default=`[‘one_hot’, ‘median_center’, ‘robust_scale’, ‘smooth_clip’, ‘embedding’]`. Other possible transformations include: ‘median_center’, ‘l2_normalize’, ‘l1_normalize’, ‘quantile’, ‘kdi’.
num_emb_type (str | None) – Type of numerical embeddings used (default=’pbld’). If not set to ‘ignore’, it overrides the parameters use_plr_embeddings, plr_act_name, plr_use_densenet, plr_use_cos_bias. Possible values: ‘ignore’, ‘none’ (no numerical embeddings), ‘pl’, ‘plr’, ‘pbld’, ‘pblrd’.
use_plr_embeddings (bool | None) – Whether PLR (or PL) numerical embeddings should be used (default=True).
plr_sigma (float | None) – Initialization standard deviation for first PLR embedding layer (default=0.1).
plr_hidden_1 (int | None) – (Half of the) number of hidden neurons in the first PLR hidden layer (default=8). This number will be doubled since there are sin() and cos() versions for each hidden neuron.
plr_hidden_2 (int | None) – Number of output neurons of the PLR hidden layer, excluding the optional densenet connection (default=7).
plr_act_name (str | None) – Name of PLR activation function (default=’linear’). Use ‘relu’ for the PLR version and ‘linear’ for the PL version.
plr_use_densenet (bool | None) – Whether to append the original feature to the numerical embeddings (default=True).
plr_use_cos_bias (bool | None) – Whether to use the cos(wx+b) version for the periodic embeddings instead of the (sin(wx), cos(wx)) version (default=True).
plr_lr_factor (float | None) – Learning rate factor for PLR embeddings (default=0.1). Gets multiplied with lr and with the value of the schedule.
max_one_hot_cat_size (int | None) – Maximum category size that one-hot encoding should be applied to, including the category for missing/unknown values (default=9).
embedding_size (int | None) – Number of output features of categorical embedding layers (default=8).
act (str | None) – Activation function (default=’selu’ for classification and ‘mish’ for regression). Can also be ‘relu’ or ‘silu’.
use_parametric_act (bool | None) – Whether to use a parametric activation as described in the paper (default=True).
act_lr_factor (float | None) – Learning rate factor for parametric activation (default=0.1).
weight_param (str | None) – Weight parametrization (default=’ntk’). See models.nn.WeightFitter() for more options.
weight_init_mode (str | None) – Weight initialization mode (default=’std’). See models.nn.WeightFitter() for more options.
weight_init_gain (str | None) – Multiplier for the weight initialization standard deviation. (Does not apply to ‘std’ initialization mode.)
weight_lr_factor (float | None) – Learning rate factor for weights.
bias_init_mode (str | None) – Bias initialization mode (default=’he+5’). See models.nn.BiasFitter() for more options.
bias_lr_factor (float | None) – Bias learning rate factor.
bias_wd_factor (float | None) – Bias weight decay factor.
add_front_scale (bool | None) – Whether to add a scaling layer (diagonal weight matrix) before the linear layers (default=True). If set to true and a scaling layer is already configured in the block_str, this will create an additional scaling layer.
scale_lr_factor (float | None) – Scaling layer learning rate factor (default=1.0 but will be overridden by default for the first layer in first_layer_config).
first_layer_lr_factor (float | None) – First layer learning rate factor (default=1.0).
block_str (str | None) – String describing the default hidden layer components. The default is ‘w-b-a-d’ for weight, bias, activation, dropout. By default, the last layer config will override it with ‘w-b’ and the first layer config will override it with ‘s-w-b-a-d’, where the ‘s’ stands for the scaling layer.
first_layer_config (Dict[str, Any] | None) – Dictionary with more options that can override the other options for the construction of the first MLP layer specifically. The default is dict(block_str=’s-w-b-a-d’, scale_lr_factor=6.0), using a scaling layer at the beginning of the first layer with lr factor 6.0.
last_layer_config (Dict[str, Any] | None) – Dictionary with more options that can override the other options for the construction of the last MLP layer specifically. The default is an empty dict, in which case the block_str will still be overridden by ‘w-b’.
middle_layer_config (Dict[str, Any] | None) – Dictionary with more options that can override the other options for the construction of the layers except first and last MLP layer. The default is an empty dict.
p_drop (float | None) – Dropout probability (default=0.15). Needs to be in [0, 1).
p_drop_sched (str | None) – Dropout schedule (default=’flat_cos’).
wd (float | None) – Weight decay implemented as in the PyTorch AdamW but works with all optimizers (default=0.0 for regression and 1e-2 for classification). Weight decay is implemented as param -= current_lr_value * current_wd_value * param where the current lr and wd values are determined using the base values (lr and wd), factors for the given parameter if available, and the respective schedule. Note that this is not identical to the original AdamW paper, where the lr base value is not included in the update equation.
wd_sched (str | None) – Weight decay schedule.
opt (str | None) – Optimizer (default=’adam’). See optim.optimizers.get_opt_class().
lr (float | Dict[str, float] | None) – Learning rate base value (default=0.04 for classification and 0.14 for regression).
lr_sched (str | None) – Learning rate schedule (default=’coslog4’). See training.scheduling.get_schedule().
mom (float | None) – Momentum parameter, aka $\beta_1$ for Adam (default=0.9).
mom_sched (str | None) – Momentum schedule (default=’constant’).
sq_mom (float | None) – Momentum of squared gradients, aka $\beta_2$ for Adam (default=0.95).
sq_mom_sched (str | None) – Schedule for sq_mom (default=’constant’).
opt_eps (float | None) – Epsilon parameter of the optimizer (default=1e-8 for Adam).
opt_eps_sched (str | None) – Schedule for opt_eps (default=’constant’).
normalize_output (bool | None) – Whether to standardize the target for regression (default=True for regression).
clamp_output (bool | None) – Whether to clamp the output for predict() for regression to the min/max range seen during training (default=True for regression).
use_ls (bool | None) – Whether to use label smoothing for classification (default=True for classification).
ls_eps (float | None) – Epsilon parameter for label smoothing (default=0.1 for classification)
ls_eps_sched (str | None) – Schedule for ls_eps (default=’constant’).
use_early_stopping (bool | None) – Whether to use early stopping (default=False). Note that even without early stopping, the best epoch on the validation set is selected if there is a validation set. Training is stopped if the epoch exceeds early_stopping_multiplicative_patience * best_epoch + early_stopping_additive_patience.
early_stopping_additive_patience (int | None) – See use_early_stopping (default=20).
early_stopping_multiplicative_patience (float | None) – See use_early_stopping (default=2). We recommend to set it to 1 for monotone learning rate schedules but to keep it at 2 for the default schedule.
calibration_method (str | None) – Post-hoc calibration method (only for classification). We recommend ‘ts-mix’ for fast temperature scaling with Laplace smoothing. For other methods, see the get_calibrator method in https://github.com/dholzmueller/probmetrics.
sort_quantile_predictions (bool | None) – If val_metric_name==’multi_pinball(…)’, decides whether the predicted quantiles will be sorted to avoid quantile crossover. Default is True.
stop_epoch (int | None) – Epoch at which training should be stopped (for refitting). The total length of training used for the schedules will be determined by n_epochs, but the stopping epoch will be min(stop_epoch, n_epochs).
use_best_mean_epoch_for_cv (bool | None) – If training an ensemble, whether they should all use a checkpoint from the same epoch with the best average loss, instead of using the best individual epochs (default=False).
n_ens (int | None) – Number of ensemble members that should be used per train-validation split (default=1). For best-epoch selection, the validation scores of averaged predictions will be used.
ens_av_before_softmax (int | None) – When using classifiction with n_ens>1, whether to average the ensemble predictions on each train-val split before taking the softmax (default=False). We recommend using False as it is representative of the averaging of models across train-val splits.

For the HPO and Ensemble variants, we currently only provide few options:

pytabkit.models.sklearn.sklearn_interfaces.RealMLP_HPO_Classifier.__init__(self, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, n_hyperopt_steps=None, val_metric_name=None, calibration_method=None, hpo_space_name=None, n_caruana_steps=None, n_epochs=None, use_caruana_ensembling=None, train_metric_name=None, time_limit_s=None)

Parameters:

device (str | None) – PyTorch device name like ‘cpu’, ‘cuda’, ‘cuda:0’, ‘mps’ (default=None). If None, ‘cuda’ will be used if available, otherwise ‘cpu’.
random_state (int | RandomState | None) – Random state to use for random number generation (splitting, initialization, batch shuffling). If None, the behavior is not deterministic.
n_cv (int) – Number of cross-validation splits to use (default=1). If validation set indices or an explicit validation set are given in fit(), n_cv models will be fitted using different random seeds. Otherwise, n_cv-fold cross-validation will be used (stratified for classification). For n_cv=1, a single train-validation split will be used, where val_fraction controls the fraction of validation samples. If n_refit=0 is set, the prediction will use the average of the models fitted during cross-validation. (Averaging is over probabilities for classification, and over outputs for regression.) Otherwise, refitted models will be used.
n_refit (int) – Number of models that should be refitted on the training+validation dataset (default=0). If zero, only the models from the cross-validation stage are used. If positive, n_refit models will be fitted on the training+validation dataset (all data given in fit()) and their predictions will be averaged during predict().
n_repeats (int) – Number of times that the (cross-)validation split should be repeated (default=1). Values != 1 are only allowed when no custom validation split is provided. Larger number of repeats make things slower but reduce the potential for validation set overfitting, especially on smaller datasets.
val_fraction (float) – Fraction of samples used for validation (default=0.2). Has to be in [0, 1). Only used if n_cv==1 and no validation split is provided in fit().
n_threads (int | None) – Number of threads that the method is allowed to use (default=number of physical cores).
tmp_folder (str | Path | None) – Folder in which models can be stored. Setting this allows reducing RAM/VRAM usage by not having all models in RAM at the same time. In this case, the folder needs to be preserved as long as the model exists (including when the model is pickled to disk).
verbosity (int) – Verbosity level (default=0, higher means more verbose). Set to 2 to see logs from intermediate epochs.
n_hyperopt_steps (int | None) – Number of random hyperparameter configs that should be used to train models (default=50).
val_metric_name (str | None) – Name of the validation metric (used for selecting the best epoch). Not used for all models but at least for RealMLP and probably TabM. Defaults are ‘class_error’ for classification and ‘rmse’ for regression. Main available classification metrics (all to be minimized): ‘class_error’, ‘cross_entropy’, ‘1-auc_ovo’, ‘1-auc_ovr’, ‘1-auc_mu’, ‘brier’, ‘1-balanced_accuracy’, ‘1-mcc’, ‘ece’. Main available regression metrics: ‘rmse’, ‘mae’, ‘max_error’, ‘pinball(0.95)’ (also works with other quantiles specified directly in the string). For more metrics, we refer to models.training.metrics.Metrics.apply().
calibration_method (str | None) – Post-hoc calibration method (only for classification) (default=None). We recommend ‘ts-mix’ for fast temperature scaling with Laplace smoothing. For other methods, see the get_calibrator method in https://github.com/dholzmueller/probmetrics.
hpo_space_name (str | None) – Name of the HPO space (default=’default’). The search space used in the paper for RealMLP is ‘default’. However, we recommend using ‘tabarena’ for the best results.
n_caruana_steps (int | None) – Number of weight update iterations for Caruana et al. weighted ensembling (default=40). This parameter is only used when use_caruana_ensembling=True.
n_epochs (int | None) – Number of epochs to train for each NN (default=None). If set, it will override the values from the search space. (Might be ignored for non-RealMLP methods.)
use_caruana_ensembling (bool | None) – Whether to use the algorithm by Caruana et al. (2004) to select a weighted ensemble of models instead of only selecting the best model (default=False).
train_metric_name (str | None) – Name of the training metric (default is cross_entropy for classification and mse for regression). For regression, pinball/multi_pinball can be used instead. (Might be ignored for non-RealMLP methods.)
time_limit_s (float | None) – Time limit in seconds (default=None).

Boosted Trees

For boosted trees, we provide the same interfaces as for RealMLP (TD, D, and HPO variants), but do not wrap the full parameter space from the respective libraries. Here are some representative examples:

pytabkit.models.sklearn.sklearn_interfaces.XGB_TD_Classifier.__init__(self, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, train_metric_name=None, val_metric_name=None, n_estimators=None, max_depth=None, lr=None, subsample=None, colsample_bytree=None, colsample_bylevel=None, colsample_bynode=None, min_child_weight=None, alpha=None, reg_lambda=None, gamma=None, tree_method=None, max_delta_step=None, max_cat_to_onehot=None, num_parallel_tree=None, max_bin=None, multi_strategy=None, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
train_metric_name (str | None)
val_metric_name (str | None)
n_estimators (int | None)
max_depth (int | None)
lr (float | None)
subsample (float | None)
colsample_bytree (float | None)
colsample_bylevel (float | None)
colsample_bynode (float | None)
min_child_weight (float | None)
alpha (float | None)
reg_lambda (float | None)
gamma (float | None)
tree_method (str | None)
max_delta_step (float | None)
max_cat_to_onehot (int | None)
num_parallel_tree (int | None)
max_bin (int | None)
multi_strategy (str | None)
calibration_method (str | None)

pytabkit.models.sklearn.sklearn_interfaces.LGBM_TD_Classifier.__init__(self, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, n_estimators=None, max_depth=None, num_leaves=None, lr=None, subsample=None, colsample_bytree=None, bagging_freq=None, min_data_in_leaf=None, min_sum_hessian_in_leaf=None, lambda_l1=None, lambda_l2=None, boosting=None, max_bin=None, cat_smooth=None, cat_l2=None, val_metric_name=None, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
n_estimators (int | None)
max_depth (int | None)
num_leaves (int | None)
lr (float | None)
subsample (float | None)
colsample_bytree (float | None)
bagging_freq (float | None)
min_data_in_leaf (int | None)
min_sum_hessian_in_leaf (int | None)
lambda_l1 (float | None)
lambda_l2 (float | None)
boosting (str | None)
max_bin (int | None)
cat_smooth (float | None)
cat_l2 (float | None)
val_metric_name (str | None)
calibration_method (str | None)

pytabkit.models.sklearn.sklearn_interfaces.CatBoost_TD_Classifier.__init__(self, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, n_estimators=None, max_depth=None, lr=None, subsample=None, colsample_bylevel=None, random_strength=None, bagging_temperature=None, leaf_estimation_iterations=None, bootstrap_type=None, boosting_type=None, min_data_in_leaf=None, grow_policy=None, num_leaves=None, max_bin=None, l2_leaf_reg=None, one_hot_max_size=None, val_metric_name=None, train_metric_name=None, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
n_estimators (int | None)
max_depth (int | None)
lr (float | None)
subsample (float | None)
colsample_bylevel (float | None)
random_strength (float | None)
bagging_temperature (float | None)
leaf_estimation_iterations (int | None)
bootstrap_type (str | None)
boosting_type (str | None)
min_data_in_leaf (int | None)
grow_policy (str | None)
num_leaves (int | None)
max_bin (int | None)
l2_leaf_reg (float | None)
one_hot_max_size (int | None)
val_metric_name (str | None)
train_metric_name (str | None)
calibration_method (str | None)

Other NN baselines

We offer interfaces (D and HPO variants) for

MLP (from the RTDL code)
ResNet (from the RTDL code)
FTT (FT-Transformer from the RTDL code)
MLP-PLR (from the RTDL code)
TabR (requires installing faiss)
TabM

pytabkit.models.sklearn.sklearn_interfaces.MLP_RTDL_D_Classifier.__init__(self, module_d_embedding=None, module_d_layers=None, module_d_first_layer=None, module_d_last_layer=None, module_n_layers=None, module_dropout=None, verbose=None, max_epochs=None, batch_size=None, optimizer=None, es_patience=None, lr=None, lr_scheduler=None, lr_patience=None, optimizer_weight_decay=None, use_checkpoints=None, transformed_target=None, tfms=None, quantile_output_distribution=None, val_metric_name=None, module_num_emb_type=None, module_num_emb_dim=None, module_num_emb_hidden_dim=None, module_num_emb_sigma=None, module_num_emb_lite=None, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

module_d_embedding (int | None)
module_d_layers (int | None)
module_d_first_layer (int | None)
module_d_last_layer (int | None)
module_n_layers (int | None)
module_dropout (float | None)
verbose (int | None)
max_epochs (int | None)
batch_size (int | None)
optimizer (str | None)
es_patience (int | None)
lr (float | None)
lr_scheduler (bool | None)
lr_patience (int | None)
optimizer_weight_decay (float | None)
use_checkpoints (bool | None)
transformed_target (bool | None)
tfms (List[str] | None)
quantile_output_distribution (str | None)
val_metric_name (str | None)
module_num_emb_type (str | None)
module_num_emb_dim (int | None)
module_num_emb_hidden_dim (int | None)
module_num_emb_sigma (float | None)
module_num_emb_lite (bool | None)
device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
calibration_method (str | None)

pytabkit.models.sklearn.sklearn_interfaces.Resnet_RTDL_D_Classifier.__init__(self, module_d_embedding=None, module_d=None, module_d_hidden_factor=None, module_n_layers=None, module_activation=None, module_normalization=None, module_hidden_dropout=None, module_residual_dropout=None, verbose=None, max_epochs=None, batch_size=None, optimizer=None, es_patience=None, lr=None, lr_scheduler=None, lr_patience=None, optimizer_weight_decay=None, use_checkpoints=None, transformed_target=None, tfms=None, quantile_output_distribution=None, val_metric_name=None, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

module_d_embedding (int | None)
module_d (int | None)
module_d_hidden_factor (float | None)
module_n_layers (int | None)
module_activation (str | None)
module_normalization (str | None)
module_hidden_dropout (float | None)
module_residual_dropout (float | None)
verbose (int | None)
max_epochs (int | None)
batch_size (int | None)
optimizer (str | None)
es_patience (int | None)
lr (float | None)
lr_scheduler (bool | None)
lr_patience (int | None)
optimizer_weight_decay (float | None)
use_checkpoints (bool | None)
transformed_target (bool | None)
tfms (List[str] | None)
quantile_output_distribution (str | None)
val_metric_name (str | None)
device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
calibration_method (str | None)

pytabkit.models.sklearn.sklearn_interfaces.FTT_D_Classifier.__init__(self, module_d_token=None, module_d_ffn_factor=None, module_n_layers=None, module_n_heads=None, module_token_bias=None, module_attention_dropout=None, module_ffn_dropout=None, module_residual_dropout=None, module_activation=None, module_prenormalization=None, module_initialization=None, module_kv_compression=None, module_kv_compression_sharing=None, verbose=None, max_epochs=None, batch_size=None, optimizer=None, es_patience=None, lr=None, lr_scheduler=None, lr_patience=None, optimizer_weight_decay=None, use_checkpoints=None, transformed_target=None, tfms=None, quantile_output_distribution=None, val_metric_name=None, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

module_d_token (int | None)
module_d_ffn_factor (float | None)
module_n_layers (int | None)
module_n_heads (int | None)
module_token_bias (bool | None)
module_attention_dropout (float | None)
module_ffn_dropout (float | None)
module_residual_dropout (float | None)
module_activation (str | None)
module_prenormalization (bool | None)
module_initialization (str | None)
module_kv_compression (str | None)
module_kv_compression_sharing (str | None)
verbose (int | None)
max_epochs (int | None)
batch_size (int | None)
optimizer (str | None)
es_patience (int | None)
lr (float | None)
lr_scheduler (bool | None)
lr_patience (int | None)
optimizer_weight_decay (float | None)
use_checkpoints (bool | None)
transformed_target (bool | None)
tfms (List[str] | None)
quantile_output_distribution (str | None)
val_metric_name (str | None)
device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
calibration_method (str | None)

pytabkit.models.sklearn.sklearn_interfaces.MLP_PLR_D_Classifier.__init__(self, module_d_embedding=None, module_d_layers=None, module_d_first_layer=None, module_d_last_layer=None, module_n_layers=None, module_dropout=None, verbose=None, max_epochs=None, batch_size=None, optimizer=None, es_patience=None, lr=None, lr_scheduler=None, lr_patience=None, optimizer_weight_decay=None, use_checkpoints=None, transformed_target=None, tfms=None, quantile_output_distribution=None, val_metric_name=None, module_num_emb_type=None, module_num_emb_dim=None, module_num_emb_hidden_dim=None, module_num_emb_sigma=None, module_num_emb_lite=None, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

module_d_embedding (int | None)
module_d_layers (int | None)
module_d_first_layer (int | None)
module_d_last_layer (int | None)
module_n_layers (int | None)
module_dropout (float | None)
verbose (int | None)
max_epochs (int | None)
batch_size (int | None)
optimizer (str | None)
es_patience (int | None)
lr (float | None)
lr_scheduler (bool | None)
lr_patience (int | None)
optimizer_weight_decay (float | None)
use_checkpoints (bool | None)
transformed_target (bool | None)
tfms (List[str] | None)
quantile_output_distribution (str | None)
val_metric_name (str | None)
module_num_emb_type (str | None)
module_num_emb_dim (int | None)
module_num_emb_hidden_dim (int | None)
module_num_emb_sigma (float | None)
module_num_emb_lite (bool | None)
device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
calibration_method (str | None)

pytabkit.models.sklearn.sklearn_interfaces.TabR_S_D_Classifier.__init__(self, num_embeddings=None, d_main=None, d_multiplier=None, encoder_n_blocks=None, predictor_n_blocks=None, mixer_normalization=None, context_dropout=None, dropout0=None, dropout1=None, normalization=None, activation=None, memory_efficient=None, candidate_encoding_batch_size=None, n_epochs=None, batch_size=None, eval_batch_size=None, context_size=None, freeze_contexts_after_n_epochs=None, optimizer=None, patience=None, transformed_target=None, tfms=None, quantile_output_distribution=None, val_metric_name=None, add_scaling_layer=None, scale_lr_factor=None, use_ntp_linear=None, linear_init_type=None, use_ntp_encoder=None, ls_eps=None, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, calibration_method=None)

Initialize self. See help(type(self)) for accurate signature.

Parameters:

num_embeddings (int | None)
d_main (int | None)
d_multiplier (int | None)
encoder_n_blocks (int | None)
predictor_n_blocks (int | None)
mixer_normalization (bool | Literal['auto'] | None)
context_dropout (float | None)
dropout0 (float | None)
dropout1 (float | None)
normalization (str | None)
activation (str | None)
memory_efficient (bool | None)
candidate_encoding_batch_size (int | None)
n_epochs (int | None)
batch_size (int | None)
eval_batch_size (int | None)
context_size (int | None)
freeze_contexts_after_n_epochs (int | None)
optimizer (Dict | None)
patience (int | None)
transformed_target (bool | None)
tfms (List[str] | None)
quantile_output_distribution (str | None)
val_metric_name (str | None)
add_scaling_layer (bool | None)
scale_lr_factor (float | None)
use_ntp_linear (bool | None)
linear_init_type (str | None)
use_ntp_encoder (bool | None)
ls_eps (float | None)
device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
calibration_method (str | None)

pytabkit.models.sklearn.sklearn_interfaces.TabM_D_Classifier.__init__(self, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, arch_type=None, tabm_k=None, num_emb_type=None, num_emb_n_bins=None, batch_size=None, lr=None, weight_decay=None, n_epochs=None, patience=None, d_embedding=None, d_block=None, n_blocks=None, dropout=None, compile_model=None, allow_amp=None, tfms=None, gradient_clipping_norm=None, calibration_method=None, share_training_batches=None, val_metric_name=None, train_metric_name=None)

Parameters:

device (str | None) – PyTorch device name like ‘cpu’, ‘cuda’, ‘cuda:0’, ‘mps’ (default=None). If None, ‘cuda’ will be used if available, otherwise ‘cpu’.
random_state (int | RandomState | None) – Random state to use for random number generation (splitting, initialization, batch shuffling). If None, the behavior is not deterministic.
n_cv (int) – Number of cross-validation splits to use (default=1). If validation set indices are given in fit(), n_cv models will be fitted using different random seeds. Otherwise, n_cv-fold cross-validation will be used (stratified for classification). If n_refit=0 is set, the prediction will use the average of the models fitted during cross-validation. (Averaging is over probabilities for classification, and over outputs for regression.) Otherwise, refitted models will be used.
n_refit (int) – Number of models that should be refitted on the training+validation dataset (default=0). If zero, only the models from the cross-validation stage are used. If positive, n_refit models will be fitted on the training+validation dataset (all data given in fit()) and their predictions will be averaged during predict().
n_repeats (int) – Number of times that the (cross-)validation split should be repeated (default=1). Values != 1 are only allowed when no custom validation split is provided. Larger number of repeats make things slower but reduce the potential for validation set overfitting, especially on smaller datasets.
val_fraction (float) – Fraction of samples used for validation (default=0.2). Has to be in [0, 1). Only used if n_cv==1 and no validation split is provided in fit().
n_threads (int | None) – Number of threads that the method is allowed to use (default=number of physical cores).
tmp_folder (str | Path | None) – Temporary folder in which data can be stored during fit(). (Currently unused for TabM and variants.) If None, methods generally try to not store intermediate data.
verbosity (int) – Verbosity level (default=0, higher means more verbose). Set to 2 to see logs from intermediate epochs.
arch_type (str | None) – Architecture type for TabM, one of [‘tabm’, ‘tabm-mini’, ‘tabm-normal’, ‘tabm-mini-normal’, ‘plain’].
tabm_k (int | None) – Value of $k$ (number of memory-efficient ensemble members). Default is 32.
num_emb_type (str | None) – Type of numerical embedding, one of [‘none’, ‘pwl’]. Default is ‘none’. ‘pwl’ stands for piecewise linear embeddings.
num_emb_n_bins (int | None) – Number of bins for piecewise linear embeddings (default=48).
batch_size (int | None)
lr (float | None)
weight_decay (float | None)
n_epochs (int | None)
patience (int | None)
d_embedding (int | None)
d_block (int | None)
n_blocks (str | int | None)
dropout (float | None)
compile_model (bool | None)
allow_amp (bool | None)
tfms (List[str] | None)
gradient_clipping_norm (float | Literal['none'] | None)
calibration_method (str | None)
share_training_batches (bool | None)
val_metric_name (str | None)
train_metric_name (str | None)

Only used when piecewise linear numerical embeddings are used. Must be at most the number of training samples, but >1. :param batch_size: Batch size, default is 256. :param lr: Learning rate, default is 2e-3. :param weight_decay: Weight decay, default is 0. :param n_epochs: Maximum number of epochs (if early stopping doesn’t apply). Default is 1 billion. :param patience: Patience for early stopping. Default is 16 :param d_embedding: Embedding dimension for numerical embeddings. :param d_block: Hidden layer size. :param n_blocks: Number of linear layers, or ‘auto’. Default is ‘auto’, which will use

3 when num_emb_type==’none’ and 2 otherwise.

Parameters:

dropout (float | None) – Dropout probability. Default is 0.1.
compile_model (bool | None) – Whether torch.compile should be applied to the model (default=False).
allow_amp (bool | None) – Whether automatic mixed precision should be used if the device is a GPU (default=False).
tfms (List[str] | None) – Preprocessing transformations, see models.nn_models.models.PreprocessingFactory. Default is [‘quantile_tabr’]. Categorical values will be one-hot encoded by the model. Note that in the original experiments, it seems that when cat_policy=’ordinal’, the ordinal-encoded categorical values will later be one-hot encoded by the model.
gradient_clipping_norm (float | Literal['none'] | None) – Norm for gradient clipping. Default is None from the example code (no gradient clipping), but the experiments from the paper use 1.0.
calibration_method (str | None) – Post-hoc calibration method (only for classification). We recommend ‘ts-mix’ for fast temperature scaling with Laplace smoothing. For other methods, see the get_calibrator method in https://github.com/dholzmueller/probmetrics.
share_training_batches (bool | None) – New in v1.4.1: Whether TabM should use the same training samples for each model in the batch (default=False). We adopt the default value False from the newer version of TabM, while the old code (prior to 1.4.1) was equivalent to share_training_batches=True, except that the new code also excludes certain parameters from weight decay.
val_metric_name (str | None) – Name of the validation metric used for early stopping. For classification, the default is ‘class_error’ but could be ‘cross_entropy’, ‘brier’, ‘1-auc_ovr’ etc. For regression, the default is ‘rmse’ but could be ‘mae’.
train_metric_name (str | None) – Name of the metric (loss) used for training. For classification, the default is ‘cross_entropy’. For regression, it is ‘mse’ but could be set to something like ‘multi_pinball(0.05,0.95)’.
device (str | None)
random_state (int | RandomState | None)
n_cv (int)
n_refit (int)
n_repeats (int)
val_fraction (float)
n_threads (int | None)
tmp_folder (str | Path | None)
verbosity (int)
arch_type (str | None)
tabm_k (int | None)
num_emb_type (str | None)
num_emb_n_bins (int | None)
batch_size (int | None)
lr (float | None)
weight_decay (float | None)
n_epochs (int | None)
patience (int | None)
d_embedding (int | None)
d_block (int | None)
n_blocks (str | int | None)

xRFM

We offer D and HPO variants for xRFM.

pytabkit.models.sklearn.sklearn_interfaces.XRFM_D_Classifier.__init__(self, device=None, random_state=None, n_cv=1, n_refit=0, n_repeats=1, val_fraction=0.2, n_threads=None, tmp_folder=None, verbosity=0, bandwidth=None, p_interp=None, exponent=None, reg=None, iters=None, diag=None, bandwidth_mode=None, kernel_type=None, max_leaf_samples=None, val_metric_name=None, early_stop_rfm=None, early_stop_multiplier=None, classification_mode=None, calibration_method=None, time_limit_s=None, M_batch_size=None)

xRFM. In case of out-of-memory, try reducing M_batch_size and/or max_leaf_samples. Some parameters generally benefit a lot from tuning, such as the regularization (reg).

Parameters:

device (str | None) – PyTorch device name like ‘cpu’, ‘cuda’, ‘cuda:0’, ‘mps’ (default=None). If None, ‘cuda’ will be used if available, otherwise ‘cpu’.
random_state (int | RandomState | None) – Random state to use for random number generation (splitting, initialization, batch shuffling). If None, the behavior is not deterministic.
n_cv (int) – Number of cross-validation splits to use (default=1). If validation set indices are given in fit(), n_cv models will be fitted using different random seeds. Otherwise, n_cv-fold cross-validation will be used (stratified for classification). If n_refit=0 is set, the prediction will use the average of the models fitted during cross-validation. (Averaging is over probabilities for classification, and over outputs for regression.) Otherwise, refitted models will be used.
n_refit (int) – Number of models that should be refitted on the training+validation dataset (default=0). If zero, only the models from the cross-validation stage are used. If positive, n_refit models will be fitted on the training+validation dataset (all data given in fit()) and their predictions will be averaged during predict().
n_repeats (int) – Number of times that the (cross-)validation split should be repeated (default=1). Values != 1 are only allowed when no custom validation split is provided. Larger number of repeats make things slower but reduce the potential for validation set overfitting, especially on smaller datasets.
val_fraction (float) – Fraction of samples used for validation (default=0.2). Has to be in [0, 1). Only used if n_cv==1 and no validation split is provided in fit().
n_threads (int | None) – Number of threads that the method is allowed to use (default=number of physical cores).
tmp_folder (str | Path | None) – Temporary folder in which data can be stored during fit(). (Currently unused for xRFM and variants.) If None, methods generally try to not store intermediate data.
verbosity (int) – Verbosity level (default=0, higher means more verbose).
bandwidth (float | None) – Bandwidth of the kernel, i.e., how wide the kernel is (default=10).
p_interp (float | None) – For kernel_type=’lpq’, this parameter controls the parameter p of the L_p norm in the exponent of the kernel. Specifically, we set p = 2 * p_interp + exponent * (1 - p_interp). Should be in [0, 1].
exponent (float | None) – Exponent of the norm inside the kernel (default=1). Should be in (0, 2]. Recommended values are in [0.7, 1.4].
reg (float | None) – Regularization parameter lambda in the kernel ridge regression (default=1e-3).
iters (int | None) – How many iterations (fitting the regressor, updating the AGOP matrix) should be done (default=5). The default should be good for most cases.
diag (bool | None) – Whether to only fit a diagonal AGOP matrix (default=True).
bandwidth_mode (str | None) – How to set the bandwidth (default=’constant’). For ‘constant’, the specified bandwidth will be used directly. For ‘adaptive’, it will be scaled relative to the median distance between samples. We recommend ‘constant’ for smaller datasets (< max_leaf_samples) where only a single RFM is fit. For larger datasets, ‘adaptive’ may be more suited since it can adapt the bandwidth to the data in the leaf.
kernel_type (str | None) – Type of kernel (default=’l2’). For ‘l2’, the L_2-norm will be used in the generalized Laplace kernel exp(-||x - x’||_2^q), where q is the exponent. This is the fastest kernel and a good default. For ‘lpq’, the slower exp(-||x - x’||_p^q) will be used, where p is determined from q and p_interp. It will use the kermac implementation if kermac is installed.
max_leaf_samples (int | None) – Maximum number of samples in a leaf of xRFM (default=60_000). For datasets with more than max_leaf_samples samples, the memory usage is O(max_leaf_samples**2) and the time complexity is roughly O(n_samples * max_leaf_samples**2). The default is around 60000, which is optimized for GPUs with ~40 GB of VRAM. Reduce this number to reduce the RAM usage. On GPUs with less VRAM, this number can be automatically lowered to avoid exceeding the maximum RAM.
val_metric_name (str | None) – Name of the validation metric (used for selecting the best iteration). Defaults are ‘class_error’ for classification and ‘rmse’ for regression. Available classification metrics (all to be minimized): ‘class_error’, ‘cross_entropy’, ‘1-auroc-ovr’, ‘brier’. Available regression metrics: ‘rmse’.
early_stop_rfm (bool | None) – Whether to stop the iterations early if the error stops decreasing (default=False).
early_stop_multiplier (float | None) – Tolerance for early stopping, should be larger than one (default=1.1). Larger values will early-stop less aggressively.
classification_mode (str | None) – How to convert classification problems to regression problems internally (default=’zero_one’). ‘zero_one’ uses a one-hot encoding, while ‘prevalence’ uses a simplex encoding with zero corresponding to the marginal class ratio.
calibration_method (str | None) – Post-hoc calibration method (only for classification) (default=None). We recommend ‘ts-mix’ for fast temperature scaling with Laplace smoothing. For other methods, see the get_calibrator method in https://github.com/dholzmueller/probmetrics.
time_limit_s (float | None) – Time limit in seconds (default=None).
M_batch_size (int | None) – Batch size used to construct the AGOP matrix M (default=8000). Higher values can speed up the computation but may lead to out-of-memory (esp. for the ‘lpq’ kernel).

Other methods

For convenience, we wrap the scikit-learn RF and MLP interfaces with our scikit-learn interfaces, although in this case the validation sets are not used. The respective classes are called RF_SKL_Classifier and MLP_SKL_Classifier etc. We also provide our Ensemble_TD_Classifier and Ensemble_HPO_Classifier, a weighted ensemble of our TD / HPO models (and similar for regression).

Saving and loading

RealMLP and possibly other models (except probably TabR) can be saved using pickle-like modules. With standard pickling, a model trained on a GPU will be restored to use the same GPU, and fail to load if the GPU is not present. (Note that dill fails to save torch models in newer torch versions, while pickle can still save them.)

The following code allows to load GPU-trained models to the CPU, but fails to run predict() due to pytorch-lightning device issues.