pytabkit.models.alg_interfaces package

Submodules

pytabkit.models.alg_interfaces.alg_interfaces module

class pytabkit.models.alg_interfaces.alg_interfaces.AlgInterface

Bases: object

AlgInterface is an abstract base class for tabular ML methods with an interfaces that offers more possibilities than a standard scikit-learn interface.

In particular, it allows for parallelized fitting of multiple models, bagging, and refitting. The idea is as follows:

  • The dataset can be split into a test set and the remaining data. (We call this a trainval-test split.)

    The fit() method allows to specify multiple such splits, and some AlgInterface implementations (NNAlgInterface) allow to vectorize computations across these splits. However, for vectorization, we may require that the test set sizes are identical in all splits.

  • The remaining data can further be split into training and validation data. (We call this a train-val split.)

    AlgInterface allows to fit with one or multiple train-val splits, which can also be vectorized in NNAlgInterface. Optionally, the function get_refit_interface() allows to extract an AlgInterface that can be used for fitting the model on training+validation set with the best settings found on the validation set in the cross-validation stage (represented by self.fit_params). These “best settings” could be an early stopping epoch or number of trees, or best hyperparameters found by hyperparameter optimization. We call this refitting.

Another feature of AlgInterface is that it provides methods to get (an estimate of) required resources and to evaluate metrics on training, validation, and test set.

__init__(fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

eval(ds, idxs_list, metrics, return_preds)

Evaluates the (already fitted) method using various metrics on training, validation, and test sets. The results will also contain the found fit_params and optionally the predictions on the dataset. This method should normally not be overridden in subclasses.

Parameters:
  • ds (DictDataset) – Dataset.

  • idxs_list (List[SplitIdxs]) – List of indices for the training-validation-test splits, one per trainval-test split as in fit().

  • metrics (Metrics | None) – Metrics object that defines which metrics should be evaluated. If metrics is None, an empty list will be returned (which might avoid unnecessary computation when implementing fit() through fit_and_eval()).

  • return_preds (bool) – Whether the predictions on the dataset should be included in the returned results.

Returns:

Returns a list with one NestedDict for every trainval-test split. Denote by results such a NestedDict object. Then, results will contain the following contents: results[‘metrics’, ‘train’/’val’/’test’, str(n_models), str(start_idx), metric_name] = metric_value Here, an ensemble of the predictions of models [start_idx:start_idx+n_models] will be used. results[‘y_preds’] = a list (converted from a tensor) with predictions on the whole dataset, included only if return_preds==True. results[‘fit_params’] = self.fit_params

Return type:

List[NestedDict]

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

fit_and_eval(ds, idxs_list, interface_resources, logger, tmp_folders, name, metrics, return_preds)

Run fit() with the given parameters and then return the result of eval() with the given metrics. This method can be overridden instead of fit() if it is more convenient. The idea is that for hyperparameter optimization, one has to evaluate each hyperparameter combination anyway after training it, so it is more efficient to implement fit_and_eval() and return the evaluation of the best method at the end. See the documentation of fit() and eval() for the meaning of the parameters and returned values.

Parameters:
Return type:

List[NestedDict]

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

get_current_predict_params_dict()
get_current_predict_params_name()
get_fit_params()
Returns:

Return self.fit_params.

Return type:

List[Dict] | None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

set_current_predict_params(name)
Parameters:

name (str)

Return type:

None

to(device)
Parameters:

device (str)

Return type:

None

class pytabkit.models.alg_interfaces.alg_interfaces.MultiSplitWrapperAlgInterface

Bases: AlgInterface

__init__(single_split_interfaces, **config)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • single_split_interfaces (List[AlgInterface])

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

fit_and_eval(ds, idxs_list, interface_resources, logger, tmp_folders, name, metrics, return_preds)

Run fit() with the given parameters and then return the result of eval() with the given metrics. This method can be overridden instead of fit() if it is more convenient. The idea is that for hyperparameter optimization, one has to evaluate each hyperparameter combination anyway after training it, so it is more efficient to implement fit_and_eval() and return the evaluation of the best method at the end. See the documentation of fit() and eval() for the meaning of the parameters and returned values.

Parameters:
Return type:

List[NestedDict]

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

set_current_predict_params(name)
Parameters:

name (str)

Return type:

None

class pytabkit.models.alg_interfaces.alg_interfaces.OptAlgInterface

Bases: SingleSplitAlgInterface

__init__(hyper_optimizer, max_resource_config, **config)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • hyper_optimizer (HyperOptimizer)

  • max_resource_config (Dict)

create_alg_interface(n_sub_splits, **config)
Parameters:

n_sub_splits (int)

Return type:

AlgInterface

fit_and_eval(ds, idxs_list, interface_resources, logger, tmp_folders, name, metrics, return_preds)

Run fit() with the given parameters and then return the result of eval() with the given metrics. This method can be overridden instead of fit() if it is more convenient. The idea is that for hyperparameter optimization, one has to evaluate each hyperparameter combination anyway after training it, so it is more efficient to implement fit_and_eval() and return the evaluation of the best method at the end. See the documentation of fit() and eval() for the meaning of the parameters and returned values.

Parameters:
Return type:

List[NestedDict]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

objective(params, ds, idxs_list, interface_resources, logger, tmp_folder, name, metrics, return_preds)
Parameters:
Return type:

Tuple[float, Tuple[List[NestedDict], AlgInterface]]

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

class pytabkit.models.alg_interfaces.alg_interfaces.RandomParamsAlgInterface

Bases: SingleSplitAlgInterface

__init__(model_idx, fit_params=None, **config)
Parameters:
  • model_idx (int) – used for seeding along with the seed given in fit(), so we can do random search HPO by combining multiple RandomParamsNNAlgInterface objects with different model_idx values-

  • fit_params (List[Dict[str, Any]] | None) – Fit parameters (stopping epoch for refitting).

  • config – Configuration parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

class pytabkit.models.alg_interfaces.alg_interfaces.SingleSplitAlgInterface

Bases: AlgInterface

pytabkit.models.alg_interfaces.autogluon_model_interfaces module

class pytabkit.models.alg_interfaces.autogluon_model_interfaces.AutoGluonModelAlgInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

pytabkit.models.alg_interfaces.base module

class pytabkit.models.alg_interfaces.base.InterfaceResources

Bases: object

Simple class representing resources that a method is allowed to use (number of threads and GPUs).

__init__(n_threads, gpu_devices, time_in_seconds=None)
Parameters:
  • n_threads (int)

  • gpu_devices (List[str])

  • time_in_seconds (int | None)

class pytabkit.models.alg_interfaces.base.RequiredResources

Bases: object

Represents estimated/requested resources by a method.

__init__(time_s, n_threads, cpu_ram_gb, n_gpus=0, gpu_usage=1.0, gpu_ram_gb=0.0, n_explicit_physical_cores=0)
Parameters:
  • time_s (float)

  • n_threads (float)

  • cpu_ram_gb (float)

  • n_gpus (int)

  • gpu_usage (float)

  • gpu_ram_gb (float)

  • n_explicit_physical_cores (int)

static combine_sequential(resources_list)
Parameters:

resources_list (List[RequiredResources])

get_resource_vector(fixed_resource_vector)
Parameters:

fixed_resource_vector (ndarray)

should_add_fixed_resources()
Return type:

bool

class pytabkit.models.alg_interfaces.base.SplitIdxs

Bases: object

Represents multiple train-validation-test splits for AlgInterface.

__init__(train_idxs, val_idxs, test_idxs, split_seed, sub_split_seeds, split_id)
Parameters:
  • train_idxs (Tensor) – Tensor of shape (n_trainval_splits, n_train_idxs). Each of the train-val splits needs to have the same number of training samples. The elements of the tensor should index the training set elements in a larger dataset.

  • val_idxs (Tensor | None) – Tensor of shape (n_trainval_splits, n_val_idxs), or None if no validation set should be used.

  • test_idxs (Tensor | None) – Tensor of shape (n_test_idxs,). The same test set will be used for all train-val splits.

  • split_seed (int) – Random seed for algorithms on this split.

  • sub_split_seeds (List[int]) – Separate random seeds for algorithms on each train-val split (length should be n_trainval_splits).

  • split_id (int) – ID of this split (for logging/saving purposes).

get_sub_split_idxs(i)
Parameters:

i (int)

Return type:

SubSplitIdxs

get_sub_split_idxs_alt(i)
Parameters:

i (int)

Return type:

SplitIdxs

class pytabkit.models.alg_interfaces.base.SubSplitIdxs

Bases: object

Represents a single trainval-test split with multiple train-val splits

__init__(train_idxs, val_idxs, test_idxs, alg_seed)
Parameters:
  • train_idxs (Tensor)

  • val_idxs (Tensor | None)

  • test_idxs (Tensor | None)

  • alg_seed (int)

pytabkit.models.alg_interfaces.calibration module

class pytabkit.models.alg_interfaces.calibration.PostHocCalibrationAlgInterface

Bases: AlgInterface

__init__(alg_interface, fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • alg_interface (AlgInterface)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

to(device)
Parameters:

device (str)

Return type:

None

pytabkit.models.alg_interfaces.catboost_interfaces module

class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostCustomMetric

Bases: object

__init__(metric_name, is_classification, is_higher_better=False, select_pred_col=None)
Parameters:
  • metric_name (str)

  • is_classification (bool)

  • is_higher_better (bool)

  • select_pred_col (int | None)

evaluate(approxes, target, weight)
get_final_error(error, weight)
is_max_optimal()
class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostHyperoptAlgInterface

Bases: OptAlgInterface

__init__(space=None, n_hyperopt_steps=50, **config)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • n_hyperopt_steps (int)

create_alg_interface(n_sub_splits, **config)
Parameters:

n_sub_splits (int)

Return type:

AlgInterface

class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostSklearnSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostSubSplitInterface

Bases: TreeBasedSubSplitInterface

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.catboost_interfaces.RandomParamsCatBoostAlgInterface

Bases: RandomParamsAlgInterface

pytabkit.models.alg_interfaces.ensemble_interfaces module

class pytabkit.models.alg_interfaces.ensemble_interfaces.AlgorithmSelectionAlgInterface

Bases: SingleSplitAlgInterface

Picks the best model out of a list of candidates.

__init__(alg_interfaces, fit_params=None, **config)
Parameters:
  • fit_params (List[Dict] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • alg_interfaces (List[AlgInterface])

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

to(device)
Parameters:

device (str)

Return type:

None

class pytabkit.models.alg_interfaces.ensemble_interfaces.CaruanaEnsembleAlgInterface

Bases: SingleSplitAlgInterface

Following a simple variant of Caruana et al. (2004), “Ensemble selection from libraries of models” without pre-selection of candidates

__init__(alg_interfaces, fit_params=None, **config)
Parameters:
  • fit_params (List[Dict] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • alg_interfaces (List[AlgInterface])

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

to(device)
Parameters:

device (str)

Return type:

None

class pytabkit.models.alg_interfaces.ensemble_interfaces.PrecomputedPredictionsAlgInterface

Bases: SingleSplitAlgInterface

__init__(y_preds_cv, y_preds_refit, fit_params_cv, fit_params_refit)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • y_preds_cv (Tensor)

  • y_preds_refit (Tensor | None)

  • fit_params_cv (Dict)

  • fit_params_refit (Dict | None)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

class pytabkit.models.alg_interfaces.ensemble_interfaces.WeightedPrediction

Bases: object

__init__(y_pred_list, task_type)
Parameters:
  • y_pred_list (List[Tensor])

  • task_type (TaskType)

predict_for_weights(weights)
Parameters:

weights (ndarray)

pytabkit.models.alg_interfaces.lightgbm_interfaces module

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMCustomMetric

Bases: object

__init__(metric_name, is_classification, is_higher_better=False)
Parameters:
  • metric_name (str)

  • is_classification (bool)

  • is_higher_better (bool)

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMHyperoptAlgInterface

Bases: OptAlgInterface

__init__(space=None, n_hyperopt_steps=50, opt_method='hyperopt', **config)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • n_hyperopt_steps (int)

  • opt_method (str)

create_alg_interface(n_sub_splits, **config)
Parameters:

n_sub_splits (int)

Return type:

AlgInterface

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMSklearnSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMSubSplitInterface

Bases: TreeBasedSubSplitInterface

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.lightgbm_interfaces.RandomParamsLGBMAlgInterface

Bases: RandomParamsAlgInterface

pytabkit.models.alg_interfaces.nn_interfaces module

class pytabkit.models.alg_interfaces.nn_interfaces.NNAlgInterface

Bases: AlgInterface

__init__(fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

get_first_layer_weights(with_scale)
Parameters:

with_scale (bool)

Return type:

Tensor

get_importances()
Return type:

Tensor

get_model_ram_gb(ds, n_cv, n_refit, n_splits, split_seeds)
Parameters:
  • ds (DictDataset)

  • n_cv (int)

  • n_refit (int)

  • n_splits (int)

  • split_seeds (List[int])

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

to(device)
Parameters:

device (str)

Return type:

None

class pytabkit.models.alg_interfaces.nn_interfaces.NNHyperoptAlgInterface

Bases: OptAlgInterface

__init__(space=None, n_hyperopt_steps=50, opt_method='hyperopt', **config)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • space (str | Dict[str, Any] | None)

  • n_hyperopt_steps (int)

  • opt_method (str)

create_alg_interface(n_sub_splits, **config)
Parameters:

n_sub_splits (int)

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.nn_interfaces.RandomParamsNNAlgInterface

Bases: SingleSplitAlgInterface

__init__(model_idx, fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • model_idx (int)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

to(device)
Parameters:

device (str)

Return type:

None

class pytabkit.models.alg_interfaces.nn_interfaces.RealMLPParamSampler

Bases: object

__init__(is_classification, hpo_space_name='default', **config)
Parameters:
  • is_classification (bool)

  • hpo_space_name (str)

sample_params(seed)
Parameters:

seed (int)

Return type:

Dict[str, Any]

pytabkit.models.alg_interfaces.nn_interfaces.get_lignting_accel_and_devices(device)
Parameters:

device (str)

pytabkit.models.alg_interfaces.other_interfaces module

class pytabkit.models.alg_interfaces.other_interfaces.ExtraTreesSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.GBTSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.GrandeSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.GrandeWrapper

Bases: object

Wrapper class for GRANDE that allows to pass cat_features in fit() instead of the constructor.

__init__(**config)
fit(X, y, X_val, y_val, cat_features=None)
Parameters:

cat_features (List[str] | None)

predict(X)
predict_proba(X)
class pytabkit.models.alg_interfaces.other_interfaces.KANSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.KNNSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.LinearModelSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.RFSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsExtraTreesAlgInterface

Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsKNNAlgInterface

Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsLinearModelAlgInterface

Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsRFAlgInterface

Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.SklearnMLPSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.TabICLSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.other_interfaces.TabPFN2SubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

pytabkit.models.alg_interfaces.resource_computation module

class pytabkit.models.alg_interfaces.resource_computation.FeatureSpec

Bases: object

Allows to create a list of product feature names from product and powerset operations etc.

static concat(*feature_specs)
static powerset_products(*feature_specs)
static product(*feature_specs)
class pytabkit.models.alg_interfaces.resource_computation.LogLinearModule

Bases: Module

__init__(n_features)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:

n_features (int)

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:

x (Tensor)

Return type:

Tensor

class pytabkit.models.alg_interfaces.resource_computation.LogLinearRegressor

Bases: object

__init__(pessimistic)
Parameters:

pessimistic (bool)

fit(X, y)
Parameters:
  • X (ndarray)

  • y (ndarray)

get_coefs()
Return type:

ndarray

class pytabkit.models.alg_interfaces.resource_computation.NormalizedDataRegressor

Bases: object

__init__(sub_regressor)
fit(X, y)
Parameters:
  • X (ndarray)

  • y (ndarray)

get_coefs()
Return type:

ndarray

predict(X)
Parameters:

X (ndarray)

Return type:

ndarray

class pytabkit.models.alg_interfaces.resource_computation.ResourcePredictor

Bases: object

Predicts resource usages based on a linear model on raw and product features.

__init__(config, time_params, cpu_ram_params, gpu_ram_params=None, n_gpus=0, gpu_usage=1.0)
Parameters:
  • config (Dict[str, Any]) – Configuration parameters.

  • time_params (Dict[str, float]) – Coefficients for the linear model for time prediction.

  • cpu_ram_params (Dict[str, float]) – Coefficients for the linear model for CPU RAM prediction.

  • gpu_ram_params (Dict[str, float] | None) – Coefficients for the linear model for GPU RAM prediction.

  • n_gpus (int) – Number of GPUs that should be used.

  • gpu_usage (float) – Usage level of each GPU (between 0 and 1).

get_required_resources(ds, **extra_params)

Function that provides an estimate of the required resources :param ds: Dataset (does not need to contain the tensors, just the n_samples and tensor_infos) :return: RequiredResources estimate.

Parameters:

ds (DictDataset)

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.resource_computation.Sampler

Bases: object

sample()
Return type:

int | float

class pytabkit.models.alg_interfaces.resource_computation.TimeWrapper

Bases: object

__init__(f)
Parameters:

f (Callable)

class pytabkit.models.alg_interfaces.resource_computation.UniformSampler

Bases: Sampler

__init__(low, high, log=False, is_int=False)
Parameters:
  • low (int | float)

  • high (int | float)

sample()
Return type:

int | float

pytabkit.models.alg_interfaces.resource_computation.create_ds(n_samples, n_cont, n_cat, cat_size, n_classes)
Parameters:
  • n_samples (int)

  • n_cont (int)

  • n_cat (int)

  • cat_size (int)

  • n_classes (int)

Return type:

DictDataset

pytabkit.models.alg_interfaces.resource_computation.ds_to_xy(ds)
Parameters:

ds (DictDataset)

Return type:

Tuple[DataFrame, ndarray]

pytabkit.models.alg_interfaces.resource_computation.eval_linear_product_model(raw_features, params)

Computes the “inner product” between the feature dictionaries (obtained from raw features and products according to the keys in params). :return:

Parameters:
  • raw_features (Dict[str, Any])

  • params (Dict[str, float])

pytabkit.models.alg_interfaces.resource_computation.fit_resource_factors(data, pessimistic, coef_factor=1.0)
Parameters:
  • data (List[Tuple[Dict[str, float], float]])

  • pessimistic (bool)

  • coef_factor (float)

pytabkit.models.alg_interfaces.resource_computation.get_resource_features(config, ds, n_cv, n_refit, n_splits, **extra_params)

Extracts features that can be used in a linear model for predicting resource usage.

Parameters:
  • config (Dict)

  • ds (DictDataset)

  • n_cv (int)

  • n_refit (int)

  • n_splits (int)

Return type:

Dict[str, float]

pytabkit.models.alg_interfaces.resource_computation.process_resource_features(raw_features, feature_spec)

Adds product features to raw features. :param raw_features: Raw feature values :param feature_spec: List of strings. Each string should be of the form ‘feature_1*…*feature_n’,

using the names of the features whose products should be added

Returns:

Returns a dictionary of the raw features along with the newly computed product features.

Parameters:
  • raw_features (Dict[str, Any])

  • feature_spec (List[str])

pytabkit.models.alg_interfaces.resource_params module

class pytabkit.models.alg_interfaces.resource_params.ResourceParams

Bases: object

cb_class_ram = {'': 0.9345478156433287, '2_power_maxdepth': 2.576133502607949e-09, '2_power_maxdepth*n_features': 7.810833280259485e-12, '2_power_maxdepth*n_features*n_samples': 1.5863977594541182e-13, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 2.3171956595374328e-17, '2_power_maxdepth*n_features*n_tree_repeats': 6.14544078331367e-15, '2_power_maxdepth*n_samples': 1.3036510550142841e-15, '2_power_maxdepth*n_samples*n_tree_repeats': 1.9523394732422347e-09, '2_power_maxdepth*n_tree_repeats': 2.356086562374563e-05, 'ds_onehot_size_gb': 0.012758554137232066, 'ds_prep_size_gb': 1.804116547565268e-05, 'ds_size_gb': 1.804116547565268e-05, 'max_depth': 0.004088255941858752, 'max_depth*n_features': 0.0006014917997388746, 'max_depth*n_features*n_samples': 4.241634070711833e-09, 'max_depth*n_features*n_samples*n_tree_repeats': 1.197601653926371e-16, 'max_depth*n_features*n_tree_repeats': 1.834250929757216e-13, 'max_depth*n_samples': 1.4477032736637855e-13, 'max_depth*n_samples*n_tree_repeats': 3.3706497906893135e-13, 'max_depth*n_tree_repeats': 1.1590969030724202e-09, 'n_features': 3.8863715875356e-09, 'n_features*n_samples': 3.767039504566679e-08, 'n_features*n_samples*n_tree_repeats': 7.361290583089635e-16, 'n_features*n_tree_repeats': 1.1947420344843242e-12, 'n_samples': 7.243808011863237e-07, 'n_samples*n_tree_repeats': 1.2285638949747794e-07, 'n_tree_repeats': 4.077606761367131e-09}
cb_class_time = {'': 1.1074866100217955, 'ds_onehot_size_gb': 2.0150542417790342e-07, 'ds_prep_size_gb': 6.2276292117813865, 'ds_size_gb': 6.2276292117813865, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 2.651274595052903e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 2.3903321610037346e-05, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 2.3930248376103085e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 8.531748659348444e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 4.589892590504275e-14, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 3.673856471950424e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 6.267867148099078e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 3.5098969397077584e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 1.7778533486675952e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 1.285253358050953e-10, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 2.627359007275516e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.133320942151551e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.629510161784679e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 4.732937240944653e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 5.508439525827261e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 8.378247017832774e-10, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 2.214973220043591, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.000849954711796066, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 2.3531597535778573e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.2994223618739465e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 3.964226717465322e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 3.035559075362487e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 7.13999461225352e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 5.1876881836135774e-09}
lgbm_class_ram = {'': 0.8604627263253337, 'ds_onehot_size_gb': 3.622669179301401e-06, 'ds_prep_size_gb': 2.0214168208781946, 'ds_size_gb': 2.0214168208781946, 'log_num_leaves': 1.573053922451339e-08, 'log_num_leaves*n_features': 2.930068871528871e-11, 'log_num_leaves*n_features*n_samples': 3.939554526330466e-15, 'log_num_leaves*n_features*n_samples*n_tree_repeats': 3.851475872271092e-15, 'log_num_leaves*n_features*n_tree_repeats': 2.7540140942935337e-13, 'log_num_leaves*n_samples': 1.617414150367892e-13, 'log_num_leaves*n_samples*n_tree_repeats': 6.161688826595097e-13, 'log_num_leaves*n_tree_repeats': 1.626145985707e-06, 'n_features': 3.1028960780988996e-10, 'n_features*n_samples': 2.5173717397818705e-08, 'n_features*n_samples*n_tree_repeats': 6.656160609292717e-11, 'n_features*n_tree_repeats': 1.4858440058980697e-12, 'n_samples': 3.856682701344501e-07, 'n_samples*n_tree_repeats': 1.544688671627044e-10, 'n_tree_repeats': 0.0015219464100389682, 'num_leaves': 7.114807543594747e-11, 'num_leaves*n_features': 6.127161836179573e-06, 'num_leaves*n_features*n_samples': 5.682583426130539e-17, 'num_leaves*n_features*n_samples*n_tree_repeats': 2.820814699620109e-14, 'num_leaves*n_features*n_tree_repeats': 4.723694325860319e-15, 'num_leaves*n_samples': 6.063719974576439e-16, 'num_leaves*n_samples*n_tree_repeats': 1.1825948996367154e-14, 'num_leaves*n_tree_repeats': 7.004349205794621e-07}
lgbm_class_time = {'': 0.07952271409861912, 'ds_onehot_size_gb': 0.6707498854892533, 'ds_prep_size_gb': 24.914198992356777, 'ds_size_gb': 24.914198992356777, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads': 1.6421556695965297e-07, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features': 0.001802775666445253, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples': 3.376112165195102e-07, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 8.92885930282138e-09, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.072475113612503e-12, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples': 2.330829367448416e-12, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.2170171882409568e-13, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_tree_repeats': 0.015956943711852814, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.15904542819723824, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.015836831101031235, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 2.320710370608533e-08, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.006248880421662e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 2.885892548234532e-11, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 3.995934332919547e-09, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 4.51061814549484e-13, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.75292585133515e-07, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads': 7.505014868911757e-10, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features': 2.152594512387446e-12, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples': 9.221334002333759e-16, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.8809384428115866e-11, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.26406208478857e-14, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples': 9.05403593468941e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 2.3824258787970722e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_tree_repeats': 0.00041603300901854167}
xgb_class_ram = {'': 0.899804501497566, '2_power_maxdepth': 3.26910486762921e-11, '2_power_maxdepth*n_features': 1.140492447521818e-08, '2_power_maxdepth*n_features*n_samples': 3.6325731146686714e-13, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 3.723108372490702e-19, '2_power_maxdepth*n_features*n_tree_repeats': 2.404137742885295e-15, '2_power_maxdepth*n_samples': 2.64316777243899e-16, '2_power_maxdepth*n_samples*n_tree_repeats': 1.4901204061072977e-17, '2_power_maxdepth*n_tree_repeats': 1.4676442049665057e-12, 'ds_onehot_size_gb': 7.280007472890875e-06, 'ds_prep_size_gb': 0.41986843027802623, 'ds_size_gb': 0.41986843027802623, 'max_depth': 3.280529943711475e-08, 'max_depth*n_features': 6.35648749681192e-05, 'max_depth*n_features*n_samples': 1.28838675675802e-08, 'max_depth*n_features*n_samples*n_tree_repeats': 1.69854661852343e-16, 'max_depth*n_features*n_tree_repeats': 1.935402530195678e-13, 'max_depth*n_samples': 6.291962320207664e-14, 'max_depth*n_samples*n_tree_repeats': 5.126839919323976e-15, 'max_depth*n_tree_repeats': 5.768929558524772e-10, 'n_features': 1.6375678219943912e-10, 'n_features*n_samples': 3.488627499883473e-11, 'n_features*n_samples*n_tree_repeats': 4.2124781789579334e-11, 'n_features*n_tree_repeats': 1.302388952570238e-12, 'n_samples': 8.808932580897527e-08, 'n_samples*n_tree_repeats': 8.625259564591089e-10, 'n_tree_repeats': 0.0012854309387287798}
xgb_class_time = {'': 1.5850150119193643e-06, 'ds_onehot_size_gb': 7.555892653328937e-06, 'ds_prep_size_gb': 67.40780781613621, 'ds_size_gb': 67.40780781613621, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 6.35528424560118e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 3.4755127308109863e-05, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 2.652000680981318e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.1214153087760665e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 1.1585222842499338e-13, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 7.369774923827121e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 6.186297360838691e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 8.810550042257941e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 9.578781115632407e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 0.007922594727428374, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 6.758297160216264e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.4232541896951673e-10, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 8.113108001263881e-12, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 1.7180121037111673e-12, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 7.916471324379998e-14, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 1.2099510988434818e-08, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 3.1700300238387285e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 4.361726529019224e-09, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 3.348195651528877e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 3.4142887744033714e-13, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 4.433229074601185e-11, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 1.7981743709586172e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 3.1379386919643983e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 0.416152219367654}
class pytabkit.models.alg_interfaces.resource_params.ResourceParamsOld

Bases: object

cb_class_ram = {'': 0.8683295939412378, '2_power_maxdepth': 0.0001056123359157812, '2_power_maxdepth*n_features': 1.0080022114889349e-10, '2_power_maxdepth*n_features*n_samples': 2.3070275489115195e-12, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 2.7850591221080067e-17, '2_power_maxdepth*n_features*n_tree_repeats': 6.15051597263584e-15, '2_power_maxdepth*n_samples': 1.3780270956209364e-15, '2_power_maxdepth*n_samples*n_tree_repeats': 2.064100170958034e-09, '2_power_maxdepth*n_tree_repeats': 2.694024798514516e-06, 'ds_onehot_size_gb': 0.054809311336043706, 'ds_prep_size_gb': 2.1956796547330758e-05, 'ds_size_gb': 2.1956796547330758e-05, 'max_depth': 0.00023942254928693192, 'max_depth*n_features': 0.0006188384463276942, 'max_depth*n_features*n_samples': 4.017104578325911e-09, 'max_depth*n_features*n_samples*n_tree_repeats': 1.2652983818045863e-16, 'max_depth*n_features*n_tree_repeats': 1.825891231551508e-13, 'max_depth*n_samples': 2.0135633249657367e-13, 'max_depth*n_samples*n_tree_repeats': 1.9065381412052897e-13, 'max_depth*n_tree_repeats': 7.662207891804141e-10, 'n_features': 1.728902260462638e-09, 'n_features*n_samples': 3.2106346545767416e-08, 'n_features*n_samples*n_tree_repeats': 8.080444898120663e-16, 'n_features*n_tree_repeats': 1.1883754249270118e-12, 'n_samples': 5.359259624964122e-07, 'n_samples*n_tree_repeats': 1.817237502556807e-07, 'n_tree_repeats': 3.16259450440823e-09}
cb_class_time = {'': 0.060695272326207535, 'ds_onehot_size_gb': 0.040427221672569374, 'ds_prep_size_gb': 2.4268955178538847, 'ds_size_gb': 2.4268955178538847, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 1.99445077397377e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 1.2644593910088394e-05, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 1.1517663973680398e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 2.4847067022145893e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 2.235731644015564e-14, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 3.0511461549128756e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 2.873281614024595e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 1.2520160532307873e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 1.374338752023958e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 7.126063129715731e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 2.631878772648314e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.4077434831895832e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 3.344879400790812e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 1.242824030666801e-12, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 9.32433742185293e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 4.062768369148915e-10, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.12139054465241507, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.002034550389178136, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 1.590097554595333e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 2.280000915439824e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 1.972850747965341e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 5.259225293072914e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.1159977413280863e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.0362927572255956e-09}
lgbm_class_ram = {'': 0.8545661661490145, 'ds_onehot_size_gb': 4.0697094447404033e-07, 'ds_prep_size_gb': 2.3080037837801175, 'ds_size_gb': 2.3080037837801175, 'log_num_leaves': 1.8470627691115034e-08, 'log_num_leaves*n_features': 4.90256931677757e-11, 'log_num_leaves*n_features*n_samples': 3.020317664222622e-15, 'log_num_leaves*n_features*n_samples*n_tree_repeats': 2.1876975907194365e-15, 'log_num_leaves*n_features*n_tree_repeats': 2.6408516124748747e-13, 'log_num_leaves*n_samples': 1.4244297306885883e-13, 'log_num_leaves*n_samples*n_tree_repeats': 7.582204707419711e-13, 'log_num_leaves*n_tree_repeats': 4.350203928522753e-07, 'n_features': 4.08148741723376e-07, 'n_features*n_samples': 2.3506833903706615e-08, 'n_features*n_samples*n_tree_repeats': 8.047116933926301e-12, 'n_features*n_tree_repeats': 1.4109066020140611e-12, 'n_samples': 2.994431799612211e-07, 'n_samples*n_tree_repeats': 1.1377985339470745e-09, 'n_tree_repeats': 0.0018080853926450316, 'num_leaves': 1.0490359582375276e-10, 'num_leaves*n_features': 6.105483514684091e-06, 'num_leaves*n_features*n_samples': 3.668665655364504e-17, 'num_leaves*n_features*n_samples*n_tree_repeats': 1.2053037667373442e-13, 'num_leaves*n_features*n_tree_repeats': 4.533114041820276e-15, 'num_leaves*n_samples': 5.943342181332617e-16, 'num_leaves*n_samples*n_tree_repeats': 1.9123390691308356e-14, 'num_leaves*n_tree_repeats': 1.0650528506541837e-07}
lgbm_class_time = {'': 0.028063263911210914, 'ds_onehot_size_gb': 0.09163862856656434, 'ds_prep_size_gb': 2.970270224525262, 'ds_size_gb': 2.970270224525262, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads': 6.47442904885375e-08, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features': 0.0001926020481234091, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples': 1.3986995179321424e-08, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 6.208468162170729e-10, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 4.598542008079632e-13, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples': 9.964309915135878e-13, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 2.608150056678177e-14, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_tree_repeats': 0.0011608214817588585, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.05612652782242183, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.0018753906815885733, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 8.471355616223231e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 3.3001370294885434e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 2.1257067882553722e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 3.057993467818764e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 6.264643485181751e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.7651417047281056e-08, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads': 1.1569746986292633e-09, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features': 2.0127433109741758e-13, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples': 2.39530599680757e-16, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.8233627245552183e-12, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 5.291223606102416e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples': 4.6777144377244544e-14, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.075739698121751e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_tree_repeats': 7.442820019642213e-05}
xgb_class_ram = {'': 0.89800664010472, '2_power_maxdepth': 3.500391185762912e-11, '2_power_maxdepth*n_features': 8.730859656468559e-07, '2_power_maxdepth*n_features*n_samples': 5.586329461516387e-11, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 3.406456640909277e-19, '2_power_maxdepth*n_features*n_tree_repeats': 2.253274531849529e-15, '2_power_maxdepth*n_samples': 2.6046111134557463e-16, '2_power_maxdepth*n_samples*n_tree_repeats': 1.4647083952656776e-17, '2_power_maxdepth*n_tree_repeats': 1.446703161897511e-12, 'ds_onehot_size_gb': 1.2775211008166364e-05, 'ds_prep_size_gb': 0.8958165176491728, 'ds_size_gb': 0.8958165176491728, 'max_depth': 4.602455291339385e-08, 'max_depth*n_features': 8.276969896399465e-05, 'max_depth*n_features*n_samples': 1.1188204977077247e-08, 'max_depth*n_features*n_samples*n_tree_repeats': 1.2101329730965103e-16, 'max_depth*n_features*n_tree_repeats': 1.73562626225241e-13, 'max_depth*n_samples': 6.003527146823594e-14, 'max_depth*n_samples*n_tree_repeats': 5.458849368989926e-15, 'max_depth*n_tree_repeats': 5.846802665464209e-10, 'n_features': 1.419262523195433e-10, 'n_features*n_samples': 2.1948939540241107e-11, 'n_features*n_samples*n_tree_repeats': 6.761378006837745e-13, 'n_features*n_tree_repeats': 1.189619404783309e-12, 'n_samples': 7.445989056176149e-08, 'n_samples*n_tree_repeats': 1.1095360093190593e-08, 'n_tree_repeats': 0.0005355693710144896}
xgb_class_time = {'': 0.04616911535729873, 'ds_onehot_size_gb': 0.0698867127341342, 'ds_prep_size_gb': 3.47457744189382, 'ds_size_gb': 3.47457744189382, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 9.064818572421352e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 2.802431219594177e-06, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 5.094046852454207e-14, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.515896055082407e-12, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 9.943166031719296e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 2.9578963011700153e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.991428507510768e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 6.993000397349683e-07, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 1.68587043083397e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 0.0007712724349247164, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 1.7162683220472862e-09, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.226904474214378e-10, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.967156404769764e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 3.601942853784541e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.5052320282512473e-14, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 0.0026046534716614215, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.09233823071459746, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 3.291166164590293e-10, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 1.914319987041818e-13, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 2.926688203905133e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 3.670077849317217e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 6.154537890478014e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 8.63288843709104e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.035228262559771e-08}

pytabkit.models.alg_interfaces.rtdl_interfaces module

class pytabkit.models.alg_interfaces.rtdl_interfaces.FTTransformerSubSplitInterface

Bases: SkorchSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_MLPSubSplitInterface

Bases: SkorchSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_MLP_ParamSamplerNew

Bases: object

__init__(is_classification, train_size, num_emb_type='none')
Parameters:
  • is_classification (bool)

  • train_size (int)

  • num_emb_type (str)

sample_params(seed)
Parameters:

seed (int)

Return type:

Dict[str, Any]

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_ResNet_ParamSampler

Bases: object

__init__(is_classification, train_size)
Parameters:
  • is_classification (bool)

  • train_size (int)

sample_params(seed)
Parameters:

seed (int)

Return type:

Dict[str, Any]

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_ResNet_ParamSamplerNew

Bases: object

__init__(is_classification, train_size)
Parameters:
  • is_classification (bool)

  • train_size (int)

sample_params(seed)
Parameters:

seed (int)

Return type:

Dict[str, Any]

class pytabkit.models.alg_interfaces.rtdl_interfaces.RandomParamsFTTransformerAlgInterface

Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.rtdl_interfaces.RandomParamsRTDLMLPAlgInterface

Bases: SingleSplitAlgInterface

__init__(model_idx, fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • model_idx (int)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

class pytabkit.models.alg_interfaces.rtdl_interfaces.RandomParamsResnetAlgInterface

Bases: SingleSplitAlgInterface

__init__(model_idx, fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • model_idx (int)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

class pytabkit.models.alg_interfaces.rtdl_interfaces.ResnetSubSplitInterface

Bases: SkorchSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.rtdl_interfaces.SkorchSubSplitInterface

Bases: SklearnSubSplitInterface

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

pytabkit.models.alg_interfaces.rtdl_interfaces.allow_single_underscore(params_config)
Parameters:

params_config (List[Tuple])

Return type:

List[Tuple]

pytabkit.models.alg_interfaces.rtdl_interfaces.choose_batch_size_rtdl(train_size)
Return type:

int

pytabkit.models.alg_interfaces.rtdl_interfaces.choose_batch_size_rtdl_new(train_size)
Parameters:

train_size (int)

Return type:

int

pytabkit.models.alg_interfaces.sub_split_interfaces module

class pytabkit.models.alg_interfaces.sub_split_interfaces.SingleSplitWrapperAlgInterface

Bases: SingleSplitAlgInterface

AlgInterface that takes multiple AlgInterfaces that can only handle a single train-val-test split and wraps them to handle a trainval-test split (possibly with multiple train-val splits)

__init__(sub_split_interfaces, fit_params=None, **config)
Parameters:
  • sub_split_interfaces (List[AlgInterface]) – Interfaces for each sub-split (train-val split).

  • fit_params (List[Dict[str, Any]] | None)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

set_current_predict_params(name)
Parameters:

name (str)

Return type:

None

class pytabkit.models.alg_interfaces.sub_split_interfaces.SklearnSubSplitInterface

Bases: SingleSplitAlgInterface

Base class for AlgInterfaces based on scikit-learn methods.

__init__(fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

class pytabkit.models.alg_interfaces.sub_split_interfaces.TreeBasedSubSplitInterface

Bases: SingleSplitAlgInterface

Base class for tree-based ML models (XGB, LGBM, CatBoost).

__init__(fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

pytabkit.models.alg_interfaces.tabm_interface module

class pytabkit.models.alg_interfaces.tabm_interface.RandomParamsTabMAlgInterface

Bases: RandomParamsAlgInterface

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

set_current_predict_params(name)
Parameters:

name (str)

Return type:

None

class pytabkit.models.alg_interfaces.tabm_interface.TabMSubSplitInterface

Bases: SingleSplitAlgInterface

__init__(fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

pytabkit.models.alg_interfaces.tabm_interface.get_tabm_auto_batch_size(n_train)
Parameters:

n_train (int)

Return type:

int

pytabkit.models.alg_interfaces.tabr_interface module

class pytabkit.models.alg_interfaces.tabr_interface.ExceptionPrintingCallback

Bases: Callback

on_exception(trainer, pl_module, exception)

Called when any trainer execution is interrupted by an exception.

class pytabkit.models.alg_interfaces.tabr_interface.RandomParamsTabRAlgInterface

Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.tabr_interface.TabRSubSplitInterface

Bases: AlgInterface

__init__(**config)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

create_model(n_num_features, n_bin_features, cat_cardinalities, n_classes, freeze_contexts_after_n_epochs)
Parameters:

freeze_contexts_after_n_epochs (int | None)

Return type:

Any

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

infer_batch_size(n_samples_train)
Parameters:

n_samples_train (int)

Return type:

int

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

pytabkit.models.alg_interfaces.xgboost_interfaces module

class pytabkit.models.alg_interfaces.xgboost_interfaces.RandomParamsXGBAlgInterface

Bases: RandomParamsAlgInterface

get_available_predict_params()
Return type:

Dict[str, Dict[str, Any]]

set_current_predict_params(name)
Parameters:

name (str)

Return type:

None

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBCustomMetric

Bases: object

__init__(metric_names, is_classification, is_higher_better=False)
Parameters:
  • metric_names (str | List[str])

  • is_classification (bool)

  • is_higher_better (bool)

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBHyperoptAlgInterface

Bases: OptAlgInterface

__init__(space=None, n_hyperopt_steps=50, **config)
Parameters:
  • fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

  • n_hyperopt_steps (int)

create_alg_interface(n_sub_splits, **config)
Parameters:

n_sub_splits (int)

Return type:

AlgInterface

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBSklearnSubSplitInterface

Bases: SklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBSubSplitInterface

Bases: TreeBasedSubSplitInterface

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

pytabkit.models.alg_interfaces.xrfm_interfaces module

class pytabkit.models.alg_interfaces.xrfm_interfaces.RandomParamsxRFMAlgInterface

Bases: RandomParamsAlgInterface

pytabkit.models.alg_interfaces.xrfm_interfaces.sample_xrfm_params(seed, hpo_space_name='default')
Parameters:
  • seed (int)

  • hpo_space_name (str)

class pytabkit.models.alg_interfaces.xrfm_interfaces.xRFMSubSplitInterface

Bases: SingleSplitAlgInterface

__init__(fit_params=None, **config)
Parameters:
  • fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.

  • config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:
  • ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.

  • idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.

  • interface_resources (InterfaceResources) – Resources assigned to fit().

  • logger (Logger) – Logger that can be used for logging.

  • tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).

  • name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:
  • n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.

  • fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:
  • ds (DictDataset) – Dataset. Does not have to contain tensors.

  • n_cv (int) – Number of train-val splits per trainval-test split.

  • n_refit (int) – Number of refitted models per trainval-test split.

  • n_splits (int) – Number of trainval-test splits.

  • split_seeds (List[int]) – Seeds for every trainval-test split.

  • n_train (int)

Returns:

Returns estimated required resources.

Return type:

RequiredResources

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:

ds (DictDataset) – Dataset on which to predict labels

Returns:

Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).

Return type:

Tensor

Module contents