pytabkit.models.data package
Submodules
pytabkit.models.data.conversion module
- class pytabkit.models.data.conversion.ToDictDatasetConverter
Bases:
object- __init__(cat_features=None, verbosity=0)
- Parameters:
cat_features (List[bool] | ndarray | None)
verbosity (int)
- fit_transform(x)
- Parameters:
x (ndarray | DataFrame | Series | DictDataset)
- Return type:
- transform(x)
- Parameters:
x (ndarray | DataFrame | Series | DictDataset)
- Return type:
pytabkit.models.data.data module
- class pytabkit.models.data.data.DictDataset
Bases:
object- __init__(tensors, tensor_infos, device=None, n_samples=None)
- Parameters:
tensors (Dict[str, Tensor] | None) – Can be None, but then device and n_samples must be specified.
tensor_infos (Dict[str, TensorInfo]) – Information (shape, category sizes) for each tensor.
device (str | device | None) – Device that tensors is on. If tensors is specified, this will be computed automatically.
n_samples (int | None) – Number of samples. If tensors is specified, this will be computed automatically.
- get_batch(idxs)
- Return type:
Dict[str, Tensor]
- get_n_classes()
- Returns:
Returns the number of classes, given by the category size of the first feature of the y tensor.
This only makes sense if there is a y tensor, and it does not check if y has more than one feature.
- get_shuffled(seed)
- Return type:
- get_size_gb()
- Returns:
RAM usage in Gigabytes
- Return type:
float
- get_sub_dataset(idxs)
- Return type:
- static join(*datasets)
- split_xy()
- Return type:
Tuple[DictDataset, DictDataset]
- to(device)
- to_df()
- Return type:
DataFrame
- without_labels()
- Return type:
- class pytabkit.models.data.data.ParallelDictDataLoader
Bases:
object- __init__(ds, idxs, batch_size, shuffle=False, adjust_bs=False, drop_last=False, output_device=None)
- Parameters:
dataset – A TaskData instance
batch_size (int) – default batch size, might be automatically adjusted
shuffle (bool) – whether the dataset should be shuffled before each epoch
adjust_bs (bool) – whether the batch_size may be lowered
ds (DictDataset)
idxs (Tensor)
drop_last (bool)
output_device (str | device | None)
so that the batches are of more equal size while keeping the number of batches the same :param drop_last: whether the last batch should be omitted if it is smaller than the other ones :param output_device: The device that the returned data should be on (if None, take the device where the data already is)
- get_num_iterated_samples()
- get_num_samples()
- class pytabkit.models.data.data.TaskType
Bases:
object- CLASSIFICATION = 'classification'
- REGRESSION = 'regression'
- class pytabkit.models.data.data.TensorInfo
Bases:
object- __init__(feat_shape=None, cat_sizes=None)
- Parameters:
feat_shape (List | ndarray | Tensor | None)
cat_sizes (List | ndarray | Tensor | None)
- static concat(tensor_infos)
Create the TensorInfo that corresponds to concatenating the tensors. :param tensor_infos: :return:
- Parameters:
tensor_infos (List[TensorInfo])
- Return type:
- static from_dict(data)
- Parameters:
data (Dict)
- Return type:
- get_cat_size_product()
- Return type:
int
- get_cat_sizes()
- Return type:
Tensor
- get_feat_shape()
- Return type:
ndarray
- get_n_features()
- Return type:
int
- is_cat()
- Return type:
bool
- is_cont()
- Return type:
bool
- is_empty()
- Return type:
bool
- to_dict()
- Return type:
Dict
- class pytabkit.models.data.data.ValDictDataLoader
Bases:
object- __init__(ds, val_idxs, val_batch_size=256)
Create a Prediction Dataloader from Dataset and validation indices
- Parameters:
ds (DictDataset)
val_idxs (Tensor)
pytabkit.models.data.nested_dict module
- class pytabkit.models.data.nested_dict.NestedDict
Bases:
objectDictionary that can be used with multiple indices. Instead of d = dict() d[‘first’] = dict() d[‘first’][‘second’] = 1.0
we can use
d = NestedDict() d[‘first’, ‘second’] = 1.0
- __init__(data_dict=None)
- static from_kwargs(**kwargs)
- get(idxs, default=None)
- get_dict()
- Return type:
Dict
- update(other)
- Parameters:
other (NestedDict)
pytabkit.models.data.splits module
- class pytabkit.models.data.splits.AllNothingSplitter
Bases:
Splitter- get_idxs(ds)
- Parameters:
ds (DictDataset)
- Return type:
Tuple[Tensor, Tensor]
- get_split_sizes(n_samples)
- Parameters:
n_samples (int)
- Return type:
Tuple
- split_ds(ds)
- Parameters:
ds (DictDataset)
- Return type:
- class pytabkit.models.data.splits.IndexSplitter
Bases:
Splitter- __init__(index)
- get_idxs(ds)
- Parameters:
ds (DictDataset)
- Return type:
Tuple[Tensor, Tensor]
- get_split_sizes(n_samples)
- Parameters:
n_samples (int)
- Return type:
Tuple
- class pytabkit.models.data.splits.KFoldSplitter
Bases:
MultiSplitter- __init__(k, seed, stratified=False)
- Parameters:
k (int)
seed (int)
- get_idxs(ds)
- Parameters:
ds (DictDataset)
- Return type:
List[Tuple[Tensor, Tensor]]
- get_split_sizes(n_samples)
- Parameters:
n_samples (int)
- Return type:
Tuple
- class pytabkit.models.data.splits.MultiSplitter
Bases:
object- get_idxs(ds)
- Parameters:
ds (DictDataset)
- Return type:
List[Tuple[Tensor, Tensor]]
- split_ds(ds)
- Parameters:
ds (DictDataset)
- Return type:
List[Split]
- class pytabkit.models.data.splits.RandomSplitter
Bases:
Splitter- __init__(seed, first_fraction=0.8, max_n_first=None)
- Parameters:
max_n_first (int | None)
- get_idxs(ds)
- Parameters:
ds (DictDataset)
- Return type:
Tuple[Tensor, Tensor]
- get_split_sizes(n_samples)
- Parameters:
n_samples (int)
- Return type:
Tuple
- class pytabkit.models.data.splits.Split
Bases:
object- __init__(ds, idxs)
- Parameters:
ds (DictDataset) – The dataset that is split into parts
idxs (Tuple[Tensor, Tensor]) – Tuple of Tensors containing indices of the different parts of ds
- get_sub_ds(i)
- get_sub_idxs(i)
- class pytabkit.models.data.splits.SplitInfo
Bases:
object- __init__(splitter, split_type, id, alg_seed, train_fraction=0.75)
- Parameters:
splitter (Splitter)
split_type (str)
id (int)
alg_seed (int)
train_fraction (float)
- get_sub_seed(split_idx, is_cv)
- Parameters:
split_idx (int)
is_cv (bool)
- get_sub_splits(ds, n_splits, is_cv)
- Parameters:
ds (DictDataset)
n_splits (int)
is_cv (bool)
- Return type:
List[Split]
- get_train_and_val_size(n_samples, n_splits, is_cv)
- Parameters:
n_samples (int)
n_splits (int)
is_cv (bool)
- Return type:
Tuple[int, int]
- class pytabkit.models.data.splits.Splitter
Bases:
object- get_idxs(ds)
- Parameters:
ds (DictDataset)
- Return type:
Tuple[Tensor, Tensor]
- get_split_sizes(n_samples)
- Parameters:
n_samples (int)
- Return type:
Tuple
- split_ds(ds)
- Parameters:
ds (DictDataset)
- Return type: