pytabkit.models.alg_interfaces package

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

set_current_predict_params(name)

Parameters:: name (str)
Return type:: None

to(device)

Parameters:: device (str)
Return type:: None

class pytabkit.models.alg_interfaces.alg_interfaces.MultiSplitWrapperAlgInterface

Bases: AlgInterface

__init__(single_split_interfaces, **config)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
single_split_interfaces (List[AlgInterface])

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

fit_and_eval(ds, idxs_list, interface_resources, logger, tmp_folders, name, metrics, return_preds)

Run fit() with the given parameters and then return the result of eval() with the given metrics. This method can be overridden instead of fit() if it is more convenient. The idea is that for hyperparameter optimization, one has to evaluate each hyperparameter combination anyway after training it, so it is more efficient to implement fit_and_eval() and return the evaluation of the best method at the end. See the documentation of fit() and eval() for the meaning of the parameters and returned values.

Parameters:

ds (DictDataset)
idxs_list (List[SplitIdxs])
interface_resources (InterfaceResources)
logger (Logger)
tmp_folders (List[Path | None])
name (str)
metrics (Metrics | None)
return_preds (bool)

Return type:

List[NestedDict]

get_available_predict_params()

Return type:: Dict[str, Dict[str, Any]]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

set_current_predict_params(name)

Parameters:: name (str)
Return type:: None

class pytabkit.models.alg_interfaces.alg_interfaces.OptAlgInterface

__init__(hyper_optimizer, max_resource_config, **config)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
hyper_optimizer (HyperOptimizer)
max_resource_config (Dict)

create_alg_interface(n_sub_splits, **config)

Parameters:: n_sub_splits (int)
Return type:: AlgInterface

fit_and_eval(ds, idxs_list, interface_resources, logger, tmp_folders, name, metrics, return_preds)

Run fit() with the given parameters and then return the result of eval() with the given metrics. This method can be overridden instead of fit() if it is more convenient. The idea is that for hyperparameter optimization, one has to evaluate each hyperparameter combination anyway after training it, so it is more efficient to implement fit_and_eval() and return the evaluation of the best method at the end. See the documentation of fit() and eval() for the meaning of the parameters and returned values.

Parameters:

ds (DictDataset)
idxs_list (List[SplitIdxs])
interface_resources (InterfaceResources)
logger (Logger)
tmp_folders (List[Path | None])
name (str)
metrics (Metrics | None)
return_preds (bool)

Return type:

List[NestedDict]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

objective(params, ds, idxs_list, interface_resources, logger, tmp_folder, name, metrics, return_preds)

Parameters:

ds (DictDataset)
idxs_list (List[SplitIdxs])
interface_resources (InterfaceResources)
logger (Logger)
tmp_folder (Path | None)
name (str)
metrics (Metrics | None)
return_preds (bool)

Return type:

Tuple[float, Tuple[List[NestedDict], AlgInterface]]

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

class pytabkit.models.alg_interfaces.alg_interfaces.RandomParamsAlgInterface

__init__(model_idx, fit_params=None, **config)

Parameters:

model_idx (int) – used for seeding along with the seed given in fit(), so we can do random search HPO by combining multiple RandomParamsNNAlgInterface objects with different model_idx values-
fit_params (List[Dict[str, Any]] | None) – Fit parameters (stopping epoch for refitting).
config – Configuration parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

class pytabkit.models.alg_interfaces.alg_interfaces.SingleSplitAlgInterface: Bases: AlgInterface

pytabkit.models.alg_interfaces.autogluon_model_interfaces module

class pytabkit.models.alg_interfaces.autogluon_model_interfaces.AutoGluonModelAlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

pytabkit.models.alg_interfaces.base module

class pytabkit.models.alg_interfaces.base.InterfaceResources

Bases: object

Simple class representing resources that a method is allowed to use (number of threads and GPUs).

__init__(n_threads, gpu_devices, time_in_seconds=None)

Parameters:

n_threads (int)
gpu_devices (List[str])
time_in_seconds (int | None)

class pytabkit.models.alg_interfaces.base.RequiredResources

Bases: object

Represents estimated/requested resources by a method.

__init__(time_s, n_threads, cpu_ram_gb, n_gpus=0, gpu_usage=1.0, gpu_ram_gb=0.0, n_explicit_physical_cores=0)

Parameters:

time_s (float)
n_threads (float)
cpu_ram_gb (float)
n_gpus (int)
gpu_usage (float)
gpu_ram_gb (float)
n_explicit_physical_cores (int)

static combine_sequential(resources_list)

Parameters:: resources_list (List[RequiredResources])

get_resource_vector(fixed_resource_vector)

Parameters:: fixed_resource_vector (ndarray)

should_add_fixed_resources()

Return type:: bool

class pytabkit.models.alg_interfaces.base.SplitIdxs

Bases: object

Represents multiple train-validation-test splits for AlgInterface.

__init__(train_idxs, val_idxs, test_idxs, split_seed, sub_split_seeds, split_id)

Parameters:

train_idxs (Tensor) – Tensor of shape (n_trainval_splits, n_train_idxs). Each of the train-val splits needs to have the same number of training samples. The elements of the tensor should index the training set elements in a larger dataset.
val_idxs (Tensor | None) – Tensor of shape (n_trainval_splits, n_val_idxs), or None if no validation set should be used.
test_idxs (Tensor | None) – Tensor of shape (n_test_idxs,). The same test set will be used for all train-val splits.
split_seed (int) – Random seed for algorithms on this split.
sub_split_seeds (List[int]) – Separate random seeds for algorithms on each train-val split (length should be n_trainval_splits).
split_id (int) – ID of this split (for logging/saving purposes).

get_sub_split_idxs(i)

Parameters:: i (int)
Return type:: SubSplitIdxs

get_sub_split_idxs_alt(i)

Parameters:: i (int)
Return type:: SplitIdxs

class pytabkit.models.alg_interfaces.base.SubSplitIdxs

Bases: object

Represents a single trainval-test split with multiple train-val splits

__init__(train_idxs, val_idxs, test_idxs, alg_seed)

Parameters:

train_idxs (Tensor)
val_idxs (Tensor | None)
test_idxs (Tensor | None)
alg_seed (int)

pytabkit.models.alg_interfaces.calibration module

class pytabkit.models.alg_interfaces.calibration.PostHocCalibrationAlgInterface

Bases: AlgInterface

__init__(alg_interface, fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
alg_interface (AlgInterface)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

to(device)

Parameters:: device (str)
Return type:: None

pytabkit.models.alg_interfaces.catboost_interfaces module

class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostCustomMetric

Bases: object

__init__(metric_name, is_classification, is_higher_better=False, select_pred_col=None)

Parameters:

metric_name (str)
is_classification (bool)
is_higher_better (bool)
select_pred_col (int | None)

evaluate(approxes, target, weight)

get_final_error(error, weight)

is_max_optimal()

class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostHyperoptAlgInterface

__init__(space=None, n_hyperopt_steps=50, **config)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
n_hyperopt_steps (int)

create_alg_interface(n_sub_splits, **config)

Parameters:: n_sub_splits (int)
Return type:: AlgInterface

class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostSklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.catboost_interfaces.CatBoostSubSplitInterface

Bases: TreeBasedSubSplitInterface

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.catboost_interfaces.RandomParamsCatBoostAlgInterface: Bases: RandomParamsAlgInterface

pytabkit.models.alg_interfaces.ensemble_interfaces module

class pytabkit.models.alg_interfaces.ensemble_interfaces.AlgorithmSelectionAlgInterface

Picks the best model out of a list of candidates.

__init__(alg_interfaces, fit_params=None, **config)

Parameters:

fit_params (List[Dict] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
alg_interfaces (List[AlgInterface])

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

to(device)

Parameters:: device (str)
Return type:: None

class pytabkit.models.alg_interfaces.ensemble_interfaces.CaruanaEnsembleAlgInterface

Following a simple variant of Caruana et al. (2004), “Ensemble selection from libraries of models” without pre-selection of candidates

__init__(alg_interfaces, fit_params=None, **config)

Parameters:

fit_params (List[Dict] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
alg_interfaces (List[AlgInterface])

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

to(device)

Parameters:: device (str)
Return type:: None

class pytabkit.models.alg_interfaces.ensemble_interfaces.PrecomputedPredictionsAlgInterface

__init__(y_preds_cv, y_preds_refit, fit_params_cv, fit_params_refit)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
y_preds_cv (Tensor)
y_preds_refit (Tensor | None)
fit_params_cv (Dict)
fit_params_refit (Dict | None)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

class pytabkit.models.alg_interfaces.ensemble_interfaces.WeightedPrediction

Bases: object

__init__(y_pred_list, task_type)

Parameters:

y_pred_list (List[Tensor])
task_type (TaskType)

predict_for_weights(weights)

Parameters:: weights (ndarray)

pytabkit.models.alg_interfaces.lightgbm_interfaces module

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMCustomMetric

Bases: object

__init__(metric_name, is_classification, is_higher_better=False)

Parameters:

metric_name (str)
is_classification (bool)
is_higher_better (bool)

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMHyperoptAlgInterface

__init__(space=None, n_hyperopt_steps=50, opt_method='hyperopt', **config)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
n_hyperopt_steps (int)
opt_method (str)

create_alg_interface(n_sub_splits, **config)

Parameters:: n_sub_splits (int)
Return type:: AlgInterface

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMSklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.lightgbm_interfaces.LGBMSubSplitInterface

Bases: TreeBasedSubSplitInterface

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.lightgbm_interfaces.RandomParamsLGBMAlgInterface: Bases: RandomParamsAlgInterface

pytabkit.models.alg_interfaces.nn_interfaces module

class pytabkit.models.alg_interfaces.nn_interfaces.NNAlgInterface

Bases: AlgInterface

__init__(fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

get_available_predict_params()

Return type:: Dict[str, Dict[str, Any]]

get_first_layer_weights(with_scale)

Parameters:: with_scale (bool)
Return type:: Tensor

get_importances()

Return type:: Tensor

get_model_ram_gb(ds, n_cv, n_refit, n_splits, split_seeds)

Parameters:

ds (DictDataset)
n_cv (int)
n_refit (int)
n_splits (int)
split_seeds (List[int])

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

to(device)

Parameters:: device (str)
Return type:: None

class pytabkit.models.alg_interfaces.nn_interfaces.NNHyperoptAlgInterface

__init__(space=None, n_hyperopt_steps=50, opt_method='hyperopt', **config)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
space (str | Dict[str, Any] | None)
n_hyperopt_steps (int)
opt_method (str)

create_alg_interface(n_sub_splits, **config)

Parameters:: n_sub_splits (int)
Return type:: AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.nn_interfaces.RandomParamsNNAlgInterface

__init__(model_idx, fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
model_idx (int)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_available_predict_params()

Return type:: Dict[str, Dict[str, Any]]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

to(device)

Parameters:: device (str)
Return type:: None

class pytabkit.models.alg_interfaces.nn_interfaces.RealMLPParamSampler

Bases: object

__init__(is_classification, hpo_space_name='default', **config)

Parameters:

is_classification (bool)
hpo_space_name (str)

sample_params(seed)

Parameters:: seed (int)
Return type:: Dict[str, Any]

pytabkit.models.alg_interfaces.nn_interfaces.get_lignting_accel_and_devices(device)

Parameters:: device (str)

pytabkit.models.alg_interfaces.other_interfaces module

class pytabkit.models.alg_interfaces.other_interfaces.ExtraTreesSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.GBTSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.GrandeSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.GrandeWrapper

Bases: object

Wrapper class for GRANDE that allows to pass cat_features in fit() instead of the constructor.

__init__(**config)

fit(X, y, X_val, y_val, cat_features=None)

Parameters:: cat_features (List[str] | None)

predict(X)

predict_proba(X)

class pytabkit.models.alg_interfaces.other_interfaces.KANSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.KNNSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.LinearModelSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.RFSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsExtraTreesAlgInterface: Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsKNNAlgInterface: Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsLinearModelAlgInterface: Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.RandomParamsRFAlgInterface: Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.other_interfaces.SklearnMLPSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.TabICLSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.other_interfaces.TabPFN2SubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

pytabkit.models.alg_interfaces.resource_computation module

class pytabkit.models.alg_interfaces.resource_computation.FeatureSpec

Bases: object

Allows to create a list of product feature names from product and powerset operations etc.

static concat(*feature_specs)

static powerset_products(*feature_specs)

static product(*feature_specs)

class pytabkit.models.alg_interfaces.resource_computation.LogLinearModule

Bases: Module

__init__(n_features)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:: n_features (int)

forward(x)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Parameters:: x (Tensor)
Return type:: Tensor

class pytabkit.models.alg_interfaces.resource_computation.LogLinearRegressor

Bases: object

__init__(pessimistic)

Parameters:: pessimistic (bool)

fit(X, y)

Parameters:

X (ndarray)
y (ndarray)

get_coefs()

Return type:: ndarray

class pytabkit.models.alg_interfaces.resource_computation.NormalizedDataRegressor

Bases: object

__init__(sub_regressor)

fit(X, y)

Parameters:

X (ndarray)
y (ndarray)

get_coefs()

Return type:: ndarray

predict(X)

Parameters:: X (ndarray)
Return type:: ndarray

class pytabkit.models.alg_interfaces.resource_computation.ResourcePredictor

Bases: object

Predicts resource usages based on a linear model on raw and product features.

__init__(config, time_params, cpu_ram_params, gpu_ram_params=None, n_gpus=0, gpu_usage=1.0)

Parameters:

config (Dict[str, Any]) – Configuration parameters.
time_params (Dict[str, float]) – Coefficients for the linear model for time prediction.
cpu_ram_params (Dict[str, float]) – Coefficients for the linear model for CPU RAM prediction.
gpu_ram_params (Dict[str, float] | None) – Coefficients for the linear model for GPU RAM prediction.
n_gpus (int) – Number of GPUs that should be used.
gpu_usage (float) – Usage level of each GPU (between 0 and 1).

get_required_resources(ds, **extra_params)

Function that provides an estimate of the required resources :param ds: Dataset (does not need to contain the tensors, just the n_samples and tensor_infos) :return: RequiredResources estimate.

Parameters:: ds (DictDataset)
Return type:: RequiredResources

class pytabkit.models.alg_interfaces.resource_computation.Sampler

Bases: object

sample()

Return type:: int | float

class pytabkit.models.alg_interfaces.resource_computation.TimeWrapper

Bases: object

__init__(f)

Parameters:: f (Callable)

class pytabkit.models.alg_interfaces.resource_computation.UniformSampler

Bases: Sampler

__init__(low, high, log=False, is_int=False)

Parameters:

low (int | float)
high (int | float)

sample()

Return type:: int | float

pytabkit.models.alg_interfaces.resource_computation.create_ds(n_samples, n_cont, n_cat, cat_size, n_classes)

Parameters:

n_samples (int)
n_cont (int)
n_cat (int)
cat_size (int)
n_classes (int)

Return type:

DictDataset

pytabkit.models.alg_interfaces.resource_computation.ds_to_xy(ds)

Parameters:: ds (DictDataset)
Return type:: Tuple[DataFrame, ndarray]

pytabkit.models.alg_interfaces.resource_computation.eval_linear_product_model(raw_features, params)

Computes the “inner product” between the feature dictionaries (obtained from raw features and products according to the keys in params). :return:

Parameters:

raw_features (Dict[str, Any])
params (Dict[str, float])

pytabkit.models.alg_interfaces.resource_computation.fit_resource_factors(data, pessimistic, coef_factor=1.0)

Parameters:

data (List[Tuple[Dict[str, float], float]])
pessimistic (bool)
coef_factor (float)

pytabkit.models.alg_interfaces.resource_computation.get_resource_features(config, ds, n_cv, n_refit, n_splits, **extra_params)

Extracts features that can be used in a linear model for predicting resource usage.

Parameters:

config (Dict)
ds (DictDataset)
n_cv (int)
n_refit (int)
n_splits (int)

Return type:

Dict[str, float]

pytabkit.models.alg_interfaces.resource_computation.process_resource_features(raw_features, feature_spec)

Adds product features to raw features. :param raw_features: Raw feature values :param feature_spec: List of strings. Each string should be of the form ‘feature_1*…*feature_n’,

using the names of the features whose products should be added

Returns:

Returns a dictionary of the raw features along with the newly computed product features.

Parameters:

raw_features (Dict[str, Any])
feature_spec (List[str])

pytabkit.models.alg_interfaces.resource_params module

class pytabkit.models.alg_interfaces.resource_params.ResourceParams

Bases: object

cb_class_ram = {'': 0.9345478156433287, '2_power_maxdepth': 2.576133502607949e-09, '2_power_maxdepth*n_features': 7.810833280259485e-12, '2_power_maxdepth*n_features*n_samples': 1.5863977594541182e-13, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 2.3171956595374328e-17, '2_power_maxdepth*n_features*n_tree_repeats': 6.14544078331367e-15, '2_power_maxdepth*n_samples': 1.3036510550142841e-15, '2_power_maxdepth*n_samples*n_tree_repeats': 1.9523394732422347e-09, '2_power_maxdepth*n_tree_repeats': 2.356086562374563e-05, 'ds_onehot_size_gb': 0.012758554137232066, 'ds_prep_size_gb': 1.804116547565268e-05, 'ds_size_gb': 1.804116547565268e-05, 'max_depth': 0.004088255941858752, 'max_depth*n_features': 0.0006014917997388746, 'max_depth*n_features*n_samples': 4.241634070711833e-09, 'max_depth*n_features*n_samples*n_tree_repeats': 1.197601653926371e-16, 'max_depth*n_features*n_tree_repeats': 1.834250929757216e-13, 'max_depth*n_samples': 1.4477032736637855e-13, 'max_depth*n_samples*n_tree_repeats': 3.3706497906893135e-13, 'max_depth*n_tree_repeats': 1.1590969030724202e-09, 'n_features': 3.8863715875356e-09, 'n_features*n_samples': 3.767039504566679e-08, 'n_features*n_samples*n_tree_repeats': 7.361290583089635e-16, 'n_features*n_tree_repeats': 1.1947420344843242e-12, 'n_samples': 7.243808011863237e-07, 'n_samples*n_tree_repeats': 1.2285638949747794e-07, 'n_tree_repeats': 4.077606761367131e-09}

cb_class_time = {'': 1.1074866100217955, 'ds_onehot_size_gb': 2.0150542417790342e-07, 'ds_prep_size_gb': 6.2276292117813865, 'ds_size_gb': 6.2276292117813865, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 2.651274595052903e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 2.3903321610037346e-05, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 2.3930248376103085e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 8.531748659348444e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 4.589892590504275e-14, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 3.673856471950424e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 6.267867148099078e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 3.5098969397077584e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 1.7778533486675952e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 1.285253358050953e-10, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 2.627359007275516e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.133320942151551e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.629510161784679e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 4.732937240944653e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 5.508439525827261e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 8.378247017832774e-10, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 2.214973220043591, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.000849954711796066, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 2.3531597535778573e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.2994223618739465e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 3.964226717465322e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 3.035559075362487e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 7.13999461225352e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 5.1876881836135774e-09}

lgbm_class_ram = {'': 0.8604627263253337, 'ds_onehot_size_gb': 3.622669179301401e-06, 'ds_prep_size_gb': 2.0214168208781946, 'ds_size_gb': 2.0214168208781946, 'log_num_leaves': 1.573053922451339e-08, 'log_num_leaves*n_features': 2.930068871528871e-11, 'log_num_leaves*n_features*n_samples': 3.939554526330466e-15, 'log_num_leaves*n_features*n_samples*n_tree_repeats': 3.851475872271092e-15, 'log_num_leaves*n_features*n_tree_repeats': 2.7540140942935337e-13, 'log_num_leaves*n_samples': 1.617414150367892e-13, 'log_num_leaves*n_samples*n_tree_repeats': 6.161688826595097e-13, 'log_num_leaves*n_tree_repeats': 1.626145985707e-06, 'n_features': 3.1028960780988996e-10, 'n_features*n_samples': 2.5173717397818705e-08, 'n_features*n_samples*n_tree_repeats': 6.656160609292717e-11, 'n_features*n_tree_repeats': 1.4858440058980697e-12, 'n_samples': 3.856682701344501e-07, 'n_samples*n_tree_repeats': 1.544688671627044e-10, 'n_tree_repeats': 0.0015219464100389682, 'num_leaves': 7.114807543594747e-11, 'num_leaves*n_features': 6.127161836179573e-06, 'num_leaves*n_features*n_samples': 5.682583426130539e-17, 'num_leaves*n_features*n_samples*n_tree_repeats': 2.820814699620109e-14, 'num_leaves*n_features*n_tree_repeats': 4.723694325860319e-15, 'num_leaves*n_samples': 6.063719974576439e-16, 'num_leaves*n_samples*n_tree_repeats': 1.1825948996367154e-14, 'num_leaves*n_tree_repeats': 7.004349205794621e-07}

lgbm_class_time = {'': 0.07952271409861912, 'ds_onehot_size_gb': 0.6707498854892533, 'ds_prep_size_gb': 24.914198992356777, 'ds_size_gb': 24.914198992356777, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads': 1.6421556695965297e-07, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features': 0.001802775666445253, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples': 3.376112165195102e-07, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 8.92885930282138e-09, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.072475113612503e-12, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples': 2.330829367448416e-12, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.2170171882409568e-13, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_tree_repeats': 0.015956943711852814, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.15904542819723824, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.015836831101031235, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 2.320710370608533e-08, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.006248880421662e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 2.885892548234532e-11, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 3.995934332919547e-09, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 4.51061814549484e-13, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.75292585133515e-07, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads': 7.505014868911757e-10, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features': 2.152594512387446e-12, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples': 9.221334002333759e-16, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.8809384428115866e-11, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.26406208478857e-14, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples': 9.05403593468941e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 2.3824258787970722e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_tree_repeats': 0.00041603300901854167}

xgb_class_ram = {'': 0.899804501497566, '2_power_maxdepth': 3.26910486762921e-11, '2_power_maxdepth*n_features': 1.140492447521818e-08, '2_power_maxdepth*n_features*n_samples': 3.6325731146686714e-13, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 3.723108372490702e-19, '2_power_maxdepth*n_features*n_tree_repeats': 2.404137742885295e-15, '2_power_maxdepth*n_samples': 2.64316777243899e-16, '2_power_maxdepth*n_samples*n_tree_repeats': 1.4901204061072977e-17, '2_power_maxdepth*n_tree_repeats': 1.4676442049665057e-12, 'ds_onehot_size_gb': 7.280007472890875e-06, 'ds_prep_size_gb': 0.41986843027802623, 'ds_size_gb': 0.41986843027802623, 'max_depth': 3.280529943711475e-08, 'max_depth*n_features': 6.35648749681192e-05, 'max_depth*n_features*n_samples': 1.28838675675802e-08, 'max_depth*n_features*n_samples*n_tree_repeats': 1.69854661852343e-16, 'max_depth*n_features*n_tree_repeats': 1.935402530195678e-13, 'max_depth*n_samples': 6.291962320207664e-14, 'max_depth*n_samples*n_tree_repeats': 5.126839919323976e-15, 'max_depth*n_tree_repeats': 5.768929558524772e-10, 'n_features': 1.6375678219943912e-10, 'n_features*n_samples': 3.488627499883473e-11, 'n_features*n_samples*n_tree_repeats': 4.2124781789579334e-11, 'n_features*n_tree_repeats': 1.302388952570238e-12, 'n_samples': 8.808932580897527e-08, 'n_samples*n_tree_repeats': 8.625259564591089e-10, 'n_tree_repeats': 0.0012854309387287798}

xgb_class_time = {'': 1.5850150119193643e-06, 'ds_onehot_size_gb': 7.555892653328937e-06, 'ds_prep_size_gb': 67.40780781613621, 'ds_size_gb': 67.40780781613621, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 6.35528424560118e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 3.4755127308109863e-05, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 2.652000680981318e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.1214153087760665e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 1.1585222842499338e-13, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 7.369774923827121e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 6.186297360838691e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 8.810550042257941e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 9.578781115632407e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 0.007922594727428374, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 6.758297160216264e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.4232541896951673e-10, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 8.113108001263881e-12, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 1.7180121037111673e-12, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 7.916471324379998e-14, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 1.2099510988434818e-08, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 3.1700300238387285e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 4.361726529019224e-09, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 3.348195651528877e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 3.4142887744033714e-13, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 4.433229074601185e-11, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 1.7981743709586172e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 3.1379386919643983e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 0.416152219367654}

class pytabkit.models.alg_interfaces.resource_params.ResourceParamsOld

Bases: object

cb_class_ram = {'': 0.8683295939412378, '2_power_maxdepth': 0.0001056123359157812, '2_power_maxdepth*n_features': 1.0080022114889349e-10, '2_power_maxdepth*n_features*n_samples': 2.3070275489115195e-12, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 2.7850591221080067e-17, '2_power_maxdepth*n_features*n_tree_repeats': 6.15051597263584e-15, '2_power_maxdepth*n_samples': 1.3780270956209364e-15, '2_power_maxdepth*n_samples*n_tree_repeats': 2.064100170958034e-09, '2_power_maxdepth*n_tree_repeats': 2.694024798514516e-06, 'ds_onehot_size_gb': 0.054809311336043706, 'ds_prep_size_gb': 2.1956796547330758e-05, 'ds_size_gb': 2.1956796547330758e-05, 'max_depth': 0.00023942254928693192, 'max_depth*n_features': 0.0006188384463276942, 'max_depth*n_features*n_samples': 4.017104578325911e-09, 'max_depth*n_features*n_samples*n_tree_repeats': 1.2652983818045863e-16, 'max_depth*n_features*n_tree_repeats': 1.825891231551508e-13, 'max_depth*n_samples': 2.0135633249657367e-13, 'max_depth*n_samples*n_tree_repeats': 1.9065381412052897e-13, 'max_depth*n_tree_repeats': 7.662207891804141e-10, 'n_features': 1.728902260462638e-09, 'n_features*n_samples': 3.2106346545767416e-08, 'n_features*n_samples*n_tree_repeats': 8.080444898120663e-16, 'n_features*n_tree_repeats': 1.1883754249270118e-12, 'n_samples': 5.359259624964122e-07, 'n_samples*n_tree_repeats': 1.817237502556807e-07, 'n_tree_repeats': 3.16259450440823e-09}

cb_class_time = {'': 0.060695272326207535, 'ds_onehot_size_gb': 0.040427221672569374, 'ds_prep_size_gb': 2.4268955178538847, 'ds_size_gb': 2.4268955178538847, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 1.99445077397377e-10, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 1.2644593910088394e-05, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 1.1517663973680398e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 2.4847067022145893e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 2.235731644015564e-14, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 3.0511461549128756e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 2.873281614024595e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 1.2520160532307873e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 1.374338752023958e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 7.126063129715731e-11, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 2.631878772648314e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.4077434831895832e-15, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 3.344879400790812e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 1.242824030666801e-12, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 9.32433742185293e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 4.062768369148915e-10, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.12139054465241507, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.002034550389178136, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 1.590097554595333e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 2.280000915439824e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 1.972850747965341e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 5.259225293072914e-06, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.1159977413280863e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.0362927572255956e-09}

lgbm_class_ram = {'': 0.8545661661490145, 'ds_onehot_size_gb': 4.0697094447404033e-07, 'ds_prep_size_gb': 2.3080037837801175, 'ds_size_gb': 2.3080037837801175, 'log_num_leaves': 1.8470627691115034e-08, 'log_num_leaves*n_features': 4.90256931677757e-11, 'log_num_leaves*n_features*n_samples': 3.020317664222622e-15, 'log_num_leaves*n_features*n_samples*n_tree_repeats': 2.1876975907194365e-15, 'log_num_leaves*n_features*n_tree_repeats': 2.6408516124748747e-13, 'log_num_leaves*n_samples': 1.4244297306885883e-13, 'log_num_leaves*n_samples*n_tree_repeats': 7.582204707419711e-13, 'log_num_leaves*n_tree_repeats': 4.350203928522753e-07, 'n_features': 4.08148741723376e-07, 'n_features*n_samples': 2.3506833903706615e-08, 'n_features*n_samples*n_tree_repeats': 8.047116933926301e-12, 'n_features*n_tree_repeats': 1.4109066020140611e-12, 'n_samples': 2.994431799612211e-07, 'n_samples*n_tree_repeats': 1.1377985339470745e-09, 'n_tree_repeats': 0.0018080853926450316, 'num_leaves': 1.0490359582375276e-10, 'num_leaves*n_features': 6.105483514684091e-06, 'num_leaves*n_features*n_samples': 3.668665655364504e-17, 'num_leaves*n_features*n_samples*n_tree_repeats': 1.2053037667373442e-13, 'num_leaves*n_features*n_tree_repeats': 4.533114041820276e-15, 'num_leaves*n_samples': 5.943342181332617e-16, 'num_leaves*n_samples*n_tree_repeats': 1.9123390691308356e-14, 'num_leaves*n_tree_repeats': 1.0650528506541837e-07}

lgbm_class_time = {'': 0.028063263911210914, 'ds_onehot_size_gb': 0.09163862856656434, 'ds_prep_size_gb': 2.970270224525262, 'ds_size_gb': 2.970270224525262, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads': 6.47442904885375e-08, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features': 0.0001926020481234091, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples': 1.3986995179321424e-08, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 6.208468162170729e-10, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 4.598542008079632e-13, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples': 9.964309915135878e-13, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 2.608150056678177e-14, 'n_cv_refit*n_splits*log_num_leaves*n_estimators*1/n_threads*n_tree_repeats': 0.0011608214817588585, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.05612652782242183, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 0.0018753906815885733, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 8.471355616223231e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 3.3001370294885434e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 2.1257067882553722e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 3.057993467818764e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 6.264643485181751e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.7651417047281056e-08, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads': 1.1569746986292633e-09, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features': 2.0127433109741758e-13, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples': 2.39530599680757e-16, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.8233627245552183e-12, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_features*n_tree_repeats': 5.291223606102416e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples': 4.6777144377244544e-14, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.075739698121751e-15, 'n_cv_refit*n_splits*num_leaves*n_estimators*1/n_threads*n_tree_repeats': 7.442820019642213e-05}

xgb_class_ram = {'': 0.89800664010472, '2_power_maxdepth': 3.500391185762912e-11, '2_power_maxdepth*n_features': 8.730859656468559e-07, '2_power_maxdepth*n_features*n_samples': 5.586329461516387e-11, '2_power_maxdepth*n_features*n_samples*n_tree_repeats': 3.406456640909277e-19, '2_power_maxdepth*n_features*n_tree_repeats': 2.253274531849529e-15, '2_power_maxdepth*n_samples': 2.6046111134557463e-16, '2_power_maxdepth*n_samples*n_tree_repeats': 1.4647083952656776e-17, '2_power_maxdepth*n_tree_repeats': 1.446703161897511e-12, 'ds_onehot_size_gb': 1.2775211008166364e-05, 'ds_prep_size_gb': 0.8958165176491728, 'ds_size_gb': 0.8958165176491728, 'max_depth': 4.602455291339385e-08, 'max_depth*n_features': 8.276969896399465e-05, 'max_depth*n_features*n_samples': 1.1188204977077247e-08, 'max_depth*n_features*n_samples*n_tree_repeats': 1.2101329730965103e-16, 'max_depth*n_features*n_tree_repeats': 1.73562626225241e-13, 'max_depth*n_samples': 6.003527146823594e-14, 'max_depth*n_samples*n_tree_repeats': 5.458849368989926e-15, 'max_depth*n_tree_repeats': 5.846802665464209e-10, 'n_features': 1.419262523195433e-10, 'n_features*n_samples': 2.1948939540241107e-11, 'n_features*n_samples*n_tree_repeats': 6.761378006837745e-13, 'n_features*n_tree_repeats': 1.189619404783309e-12, 'n_samples': 7.445989056176149e-08, 'n_samples*n_tree_repeats': 1.1095360093190593e-08, 'n_tree_repeats': 0.0005355693710144896}

xgb_class_time = {'': 0.04616911535729873, 'ds_onehot_size_gb': 0.0698867127341342, 'ds_prep_size_gb': 3.47457744189382, 'ds_size_gb': 3.47457744189382, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads': 9.064818572421352e-11, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features': 2.802431219594177e-06, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples': 5.094046852454207e-14, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 4.515896055082407e-12, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_features*n_tree_repeats': 9.943166031719296e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples': 2.9578963011700153e-15, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.991428507510768e-16, 'n_cv_refit*n_splits*2_power_maxdepth*n_estimators*1/n_threads*n_tree_repeats': 6.993000397349683e-07, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads': 1.68587043083397e-08, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features': 0.0007712724349247164, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples': 1.7162683220472862e-09, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 1.226904474214378e-10, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_features*n_tree_repeats': 6.967156404769764e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples': 3.601942853784541e-13, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_samples*n_tree_repeats': 1.5052320282512473e-14, 'n_cv_refit*n_splits*max_depth*n_estimators*1/n_threads*n_tree_repeats': 0.0026046534716614215, 'n_cv_refit*n_splits*n_estimators*1/n_threads': 0.09233823071459746, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features': 3.291166164590293e-10, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples': 1.914319987041818e-13, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_samples*n_tree_repeats': 2.926688203905133e-15, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_features*n_tree_repeats': 3.670077849317217e-12, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples': 6.154537890478014e-07, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_samples*n_tree_repeats': 8.63288843709104e-14, 'n_cv_refit*n_splits*n_estimators*1/n_threads*n_tree_repeats': 3.035228262559771e-08}

pytabkit.models.alg_interfaces.rtdl_interfaces module

class pytabkit.models.alg_interfaces.rtdl_interfaces.FTTransformerSubSplitInterface

Bases: SkorchSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_MLPSubSplitInterface

Bases: SkorchSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_MLP_ParamSamplerNew

Bases: object

__init__(is_classification, train_size, num_emb_type='none')

Parameters:

is_classification (bool)
train_size (int)
num_emb_type (str)

sample_params(seed)

Parameters:: seed (int)
Return type:: Dict[str, Any]

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_ResNet_ParamSampler

Bases: object

__init__(is_classification, train_size)

Parameters:

is_classification (bool)
train_size (int)

sample_params(seed)

Parameters:: seed (int)
Return type:: Dict[str, Any]

class pytabkit.models.alg_interfaces.rtdl_interfaces.RTDL_ResNet_ParamSamplerNew

Bases: object

__init__(is_classification, train_size)

Parameters:

is_classification (bool)
train_size (int)

sample_params(seed)

Parameters:: seed (int)
Return type:: Dict[str, Any]

class pytabkit.models.alg_interfaces.rtdl_interfaces.RandomParamsFTTransformerAlgInterface: Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.rtdl_interfaces.RandomParamsRTDLMLPAlgInterface

__init__(model_idx, fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
model_idx (int)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

class pytabkit.models.alg_interfaces.rtdl_interfaces.RandomParamsResnetAlgInterface

__init__(model_idx, fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
model_idx (int)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

class pytabkit.models.alg_interfaces.rtdl_interfaces.ResnetSubSplitInterface

Bases: SkorchSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.rtdl_interfaces.SkorchSubSplitInterface

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

pytabkit.models.alg_interfaces.rtdl_interfaces.allow_single_underscore(params_config)

Parameters:: params_config (List[Tuple])
Return type:: List[Tuple]

pytabkit.models.alg_interfaces.rtdl_interfaces.choose_batch_size_rtdl(train_size)

Return type:: int

pytabkit.models.alg_interfaces.rtdl_interfaces.choose_batch_size_rtdl_new(train_size)

Parameters:: train_size (int)
Return type:: int

pytabkit.models.alg_interfaces.sub_split_interfaces module

class pytabkit.models.alg_interfaces.sub_split_interfaces.SingleSplitWrapperAlgInterface

AlgInterface that takes multiple AlgInterfaces that can only handle a single train-val-test split and wraps them to handle a trainval-test split (possibly with multiple train-val splits)

__init__(sub_split_interfaces, fit_params=None, **config)

Parameters:

sub_split_interfaces (List[AlgInterface]) – Interfaces for each sub-split (train-val split).
fit_params (List[Dict[str, Any]] | None)

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_available_predict_params()

Return type:: Dict[str, Dict[str, Any]]

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

set_current_predict_params(name)

Parameters:: name (str)
Return type:: None

class pytabkit.models.alg_interfaces.sub_split_interfaces.SklearnSubSplitInterface

Base class for AlgInterfaces based on scikit-learn methods.

__init__(fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

class pytabkit.models.alg_interfaces.sub_split_interfaces.TreeBasedSubSplitInterface

Base class for tree-based ML models (XGB, LGBM, CatBoost).

__init__(fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_available_predict_params()

Return type:: Dict[str, Dict[str, Any]]

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

pytabkit.models.alg_interfaces.tabm_interface module

class pytabkit.models.alg_interfaces.tabm_interface.RandomParamsTabMAlgInterface

Bases: RandomParamsAlgInterface

get_available_predict_params()

Return type:: Dict[str, Dict[str, Any]]

set_current_predict_params(name)

Parameters:: name (str)
Return type:: None

class pytabkit.models.alg_interfaces.tabm_interface.TabMSubSplitInterface

__init__(fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

AlgInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

pytabkit.models.alg_interfaces.tabm_interface.get_tabm_auto_batch_size(n_train)

Parameters:: n_train (int)
Return type:: int

pytabkit.models.alg_interfaces.tabr_interface module

class pytabkit.models.alg_interfaces.tabr_interface.ExceptionPrintingCallback

Bases: Callback

on_exception(trainer, pl_module, exception): Called when any trainer execution is interrupted by an exception.

class pytabkit.models.alg_interfaces.tabr_interface.RandomParamsTabRAlgInterface: Bases: RandomParamsAlgInterface

class pytabkit.models.alg_interfaces.tabr_interface.TabRSubSplitInterface

Bases: AlgInterface

__init__(**config)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.

create_model(n_num_features, n_bin_features, cat_cardinalities, n_classes, freeze_contexts_after_n_epochs)

Parameters:: freeze_contexts_after_n_epochs (int | None)
Return type:: Any

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

infer_batch_size(n_samples_train)

Parameters:: n_samples_train (int)
Return type:: int

predict(ds)

Method to predict labels on the given dataset. Override in subclasses.

Parameters:: ds (DictDataset) – Dataset on which to predict labels
Returns:: Returns a tensor of shape [n_trainval_splits * n_splits, ds.n_samples, output_shape] In the classification case, output_shape will be the number of classes (even in the binary case) and the outputs will be logits (i.e., softmax should be applied to get probabilities) In the regression case, output_shape will be the target dimension (often 1).
Return type:: Tensor

pytabkit.models.alg_interfaces.xgboost_interfaces module

class pytabkit.models.alg_interfaces.xgboost_interfaces.RandomParamsXGBAlgInterface

Bases: RandomParamsAlgInterface

get_available_predict_params()

Return type:: Dict[str, Dict[str, Any]]

set_current_predict_params(name)

Parameters:: name (str)
Return type:: None

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBCustomMetric

Bases: object

__init__(metric_names, is_classification, is_higher_better=False)

Parameters:

metric_names (str | List[str])
is_classification (bool)
is_higher_better (bool)

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBHyperoptAlgInterface

__init__(space=None, n_hyperopt_steps=50, **config)

Parameters:

fit_params – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.
n_hyperopt_steps (int)

create_alg_interface(n_sub_splits, **config)

Parameters:: n_sub_splits (int)
Return type:: AlgInterface

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBSklearnSubSplitInterface

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

class pytabkit.models.alg_interfaces.xgboost_interfaces.XGBSubSplitInterface

Bases: TreeBasedSubSplitInterface

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type:

pytabkit.models.alg_interfaces.xrfm_interfaces module

class pytabkit.models.alg_interfaces.xrfm_interfaces.RandomParamsxRFMAlgInterface: Bases: RandomParamsAlgInterface

pytabkit.models.alg_interfaces.xrfm_interfaces.sample_xrfm_params(seed, hpo_space_name='default')

Parameters:

seed (int)
hpo_space_name (str)

class pytabkit.models.alg_interfaces.xrfm_interfaces.xRFMSubSplitInterface

__init__(fit_params=None, **config)

Parameters:

fit_params (List[Dict[str, Any]] | None) – This parameter can be used to store the best hyperparameters found during fit() in (cross-)validation mode. These can then be used for fit() in refitting mode. If fit_params is not None, it should be a list with one dictionary per trainval-test split. The dictionaries then contain the obtained hyperparameters for each of the trainval-test splits. Normally, there are no best parameters per train-val split as we might not have the same number of refitted models as train-val splits.
config – Other parameters.

fit(ds, idxs_list, interface_resources, logger, tmp_folders, name)

Fit the models on the given data and splits. Should be overridden by subclasses unless fit_and_eval() is overloaded. In the latter case, this method will by default use fit_and_eval() and discard the evaluation.

Parameters:

ds (DictDataset) – DictDataset representing the dataset. Should be on the CPU.
idxs_list (List[SplitIdxs]) – List containing one SplitIdxs object per trainval-test split. Indices should be on the CPU.
interface_resources (InterfaceResources) – Resources assigned to fit().
logger (Logger) – Logger that can be used for logging.
tmp_folders (List[Path | None]) – List of paths that can be used for storing intermediate data. The paths can be None, in which case methods will try not to save intermediate results. There should be one folder per trainval-test-split (i.e. only one per k-fold CV).
name (str) – Name of the algorithm (for logging).

Returns:

May return information about different possible fit_params settings that can be used. Say a variable results is returned that is not None. Then, results[tt_split_idx][tv_split_idx] should be a list of tuples (params, loss). This is useful for k-fold cross-validation, where the params with the best average loss (averaged over tv_split_idx) can be selected for fit_params.

Return type:

List[List[List[Tuple[Dict, float]]]] | None

get_refit_interface(n_refit, fit_params=None)

Returns another AlgInterface that is configured for refitting on the training and validation data. Override in subclasses.

Parameters:

n_refit (int) – Number of models that should be refitted (with different seeds) per trainval-test split.
fit_params (List[Dict] | None) – Fit parameters (see the constructor) that should be used for refitting. If fit_params is None, self.fit_params will be used instead.

Returns:

Returns the AlgInterface object for refitting.

Return type:

get_required_resources(ds, n_cv, n_refit, n_splits, split_seeds, n_train)

Estimate the required resources for fit().

Parameters:

ds (DictDataset) – Dataset. Does not have to contain tensors.
n_cv (int) – Number of train-val splits per trainval-test split.
n_refit (int) – Number of refitted models per trainval-test split.
n_splits (int) – Number of trainval-test splits.
split_seeds (List[int]) – Seeds for every trainval-test split.
n_train (int)

Returns:

Returns estimated required resources.

Return type: