pytabkit.models.data package

Submodules

pytabkit.models.data.conversion module

class pytabkit.models.data.conversion.ToDictDatasetConverter

Bases: object

__init__(cat_features=None, verbosity=0)
Parameters:
  • cat_features (List[bool] | ndarray | None)

  • verbosity (int)

fit_transform(x)
Parameters:

x (ndarray | DataFrame | Series | DictDataset)

Return type:

DictDataset

transform(x)
Parameters:

x (ndarray | DataFrame | Series | DictDataset)

Return type:

DictDataset

pytabkit.models.data.data module

class pytabkit.models.data.data.DictDataset

Bases: object

__init__(tensors, tensor_infos, device=None, n_samples=None)
Parameters:
  • tensors (Dict[str, Tensor] | None) – Can be None, but then device and n_samples must be specified.

  • tensor_infos (Dict[str, TensorInfo]) – Information (shape, category sizes) for each tensor.

  • device (str | device | None) – Device that tensors is on. If tensors is specified, this will be computed automatically.

  • n_samples (int | None) – Number of samples. If tensors is specified, this will be computed automatically.

get_batch(idxs)
Return type:

Dict[str, Tensor]

get_n_classes()
Returns:

Returns the number of classes, given by the category size of the first feature of the y tensor.

This only makes sense if there is a y tensor, and it does not check if y has more than one feature.

get_shuffled(seed)
Return type:

DictDataset

get_size_gb()
Returns:

RAM usage in Gigabytes

Return type:

float

get_sub_dataset(idxs)
Return type:

DictDataset

static join(*datasets)
split_xy()
Return type:

Tuple[DictDataset, DictDataset]

to(device)
to_df()
Return type:

DataFrame

without_labels()
Return type:

DictDataset

class pytabkit.models.data.data.ParallelDictDataLoader

Bases: object

__init__(ds, idxs, batch_size, shuffle=False, adjust_bs=False, drop_last=False, output_device=None)
Parameters:
  • dataset – A TaskData instance

  • batch_size (int) – default batch size, might be automatically adjusted

  • shuffle (bool) – whether the dataset should be shuffled before each epoch

  • adjust_bs (bool) – whether the batch_size may be lowered

  • ds (DictDataset)

  • idxs (Tensor)

  • drop_last (bool)

  • output_device (str | device | None)

so that the batches are of more equal size while keeping the number of batches the same :param drop_last: whether the last batch should be omitted if it is smaller than the other ones :param output_device: The device that the returned data should be on (if None, take the device where the data already is)

get_num_iterated_samples()
get_num_samples()
class pytabkit.models.data.data.TaskType

Bases: object

CLASSIFICATION = 'classification'
REGRESSION = 'regression'
class pytabkit.models.data.data.TensorInfo

Bases: object

__init__(feat_shape=None, cat_sizes=None)
Parameters:
  • feat_shape (List | ndarray | Tensor | None)

  • cat_sizes (List | ndarray | Tensor | None)

static concat(tensor_infos)

Create the TensorInfo that corresponds to concatenating the tensors. :param tensor_infos: :return:

Parameters:

tensor_infos (List[TensorInfo])

Return type:

TensorInfo

static from_dict(data)
Parameters:

data (Dict)

Return type:

TensorInfo

get_cat_size_product()
Return type:

int

get_cat_sizes()
Return type:

Tensor

get_feat_shape()
Return type:

ndarray

get_n_features()
Return type:

int

is_cat()
Return type:

bool

is_cont()
Return type:

bool

is_empty()
Return type:

bool

to_dict()
Return type:

Dict

class pytabkit.models.data.data.ValDictDataLoader

Bases: object

__init__(ds, val_idxs, val_batch_size=256)

Create a Prediction Dataloader from Dataset and validation indices

Parameters:

pytabkit.models.data.nested_dict module

class pytabkit.models.data.nested_dict.NestedDict

Bases: object

Dictionary that can be used with multiple indices. Instead of d = dict() d[‘first’] = dict() d[‘first’][‘second’] = 1.0

we can use

d = NestedDict() d[‘first’, ‘second’] = 1.0

__init__(data_dict=None)
static from_kwargs(**kwargs)
get(idxs, default=None)
get_dict()
Return type:

Dict

update(other)
Parameters:

other (NestedDict)

pytabkit.models.data.splits module

class pytabkit.models.data.splits.AllNothingSplitter

Bases: Splitter

get_idxs(ds)
Parameters:

ds (DictDataset)

Return type:

Tuple[Tensor, Tensor]

get_split_sizes(n_samples)
Parameters:

n_samples (int)

Return type:

Tuple

split_ds(ds)
Parameters:

ds (DictDataset)

Return type:

Split

class pytabkit.models.data.splits.IndexSplitter

Bases: Splitter

__init__(index)
get_idxs(ds)
Parameters:

ds (DictDataset)

Return type:

Tuple[Tensor, Tensor]

get_split_sizes(n_samples)
Parameters:

n_samples (int)

Return type:

Tuple

class pytabkit.models.data.splits.KFoldSplitter

Bases: MultiSplitter

__init__(k, seed, stratified=False)
Parameters:
  • k (int)

  • seed (int)

get_idxs(ds)
Parameters:

ds (DictDataset)

Return type:

List[Tuple[Tensor, Tensor]]

get_split_sizes(n_samples)
Parameters:

n_samples (int)

Return type:

Tuple

class pytabkit.models.data.splits.MultiSplitter

Bases: object

get_idxs(ds)
Parameters:

ds (DictDataset)

Return type:

List[Tuple[Tensor, Tensor]]

split_ds(ds)
Parameters:

ds (DictDataset)

Return type:

List[Split]

class pytabkit.models.data.splits.RandomSplitter

Bases: Splitter

__init__(seed, first_fraction=0.8, max_n_first=None)
Parameters:

max_n_first (int | None)

get_idxs(ds)
Parameters:

ds (DictDataset)

Return type:

Tuple[Tensor, Tensor]

get_split_sizes(n_samples)
Parameters:

n_samples (int)

Return type:

Tuple

class pytabkit.models.data.splits.Split

Bases: object

__init__(ds, idxs)
Parameters:
  • ds (DictDataset) – The dataset that is split into parts

  • idxs (Tuple[Tensor, Tensor]) – Tuple of Tensors containing indices of the different parts of ds

get_sub_ds(i)
get_sub_idxs(i)
class pytabkit.models.data.splits.SplitInfo

Bases: object

__init__(splitter, split_type, id, alg_seed, train_fraction=0.75)
Parameters:
  • splitter (Splitter)

  • split_type (str)

  • id (int)

  • alg_seed (int)

  • train_fraction (float)

get_sub_seed(split_idx, is_cv)
Parameters:
  • split_idx (int)

  • is_cv (bool)

get_sub_splits(ds, n_splits, is_cv)
Parameters:
Return type:

List[Split]

get_train_and_val_size(n_samples, n_splits, is_cv)
Parameters:
  • n_samples (int)

  • n_splits (int)

  • is_cv (bool)

Return type:

Tuple[int, int]

class pytabkit.models.data.splits.Splitter

Bases: object

get_idxs(ds)
Parameters:

ds (DictDataset)

Return type:

Tuple[Tensor, Tensor]

get_split_sizes(n_samples)
Parameters:

n_samples (int)

Return type:

Tuple

split_ds(ds)
Parameters:

ds (DictDataset)

Return type:

Split

Module contents