NN implementation
While RealMLP is implemented in PyTorch,
we extend the conventional nn.Module logic.
Traditionally, one writes some PyTorch code to assemble a NN model,
which is a nn.Module composed of building blocks
that are also nn.Module objects (Composite design pattern).
The nn.Module classes initialize the parameters in the constructor
and are then callable objects providing the forward() transformation.
Data preprocessing is done separately via different code/classes.
We use a different structure of classes that unifies preprocessing and NN layers,
which is useful for vectorized NNs:
The vectorized NNs can share a single non-preprocessed data set,
loaded into GPU RAM,
while having different preprocessing parameters
(fitted on different training sets since different splits are used).
Individual preprocessed data sets are never fully instantiated in GPU RAM;
instead, the vectorized NN models do preprocessing on batches individually,
which saves GPU RAM (we’re talking e.g.
about having 50-100 NNs on the same GPU at the same time).
The class structure uses three base classes:
Layerclasses are similar to nn.Module, but they do not perform random initialization in the constructor. Instead, they simply take the already initialized parameters as input. There are some additional features: Layer objects of the same type can be combined into a vectorized Layer. The vectorized NN is not built directly, but first NNs are built and initialized sequentially for better reproducibility (random seed etc.) and RAM saving, and then they are vectorized after initialization using the Layer.stack() function. Additionally, Layer classes work with the DictDataset class, which usually contains ‘x_cont’ and ‘x_cat’ tensors for continuous and categorical variables. Moreover, during training, we also pass the labels ‘y’ through the Layer, which allows to implement mixup, label smoothing, and output standardization as Layer objects.Fitterclasses initialize the NN based on a single forward pass on the (subsampled) training (and possibly validation) set. This is done using thefit()orfit_transform()functions similar to scikit-learn preprocessing classes, which return aLayerobject (and, in case offit_transform(), the transformed dataset). Initialization can be random or depending on the so far transformed training set. Typically, parameters of preprocessing layers such as standardization depend on the training set, while NN parameters do not depend on the training set. However, we also use weight and bias initializations that depend on the training set, and the unification of NN and preprocessing makes this much more convenient.FitterFactory(could also be called ArchitectureBuilder) classes build the NN structure based on the input and output shape and type. Specifically,FitterFactoryobjects can buildFitterobjects given the corresponding ‘tensor_infos’ of the data set, which specifies the number of continuous variables, the number of categorical variables and the category sizes, and the same for the labels. For example, aFitterFactorycan decide to use one-hot encoding for categorical variables with small category sizes, and Embedding layers for larger category sizes.
The Layer, Fitter, and FitterFactory classes are defined in model/base.py.
Other subclasses are also defined in model folder. There are some more features:
We introduce a class called
Variablethat inherits fromtorch.nn.Parameter. Variable has a parametertrainable: bool, and in the casetrainable==False, theLayerclass will register it usingregister_buffer(). One might also be able to just usenn.Parameter(..., requires_grad=False)for this, though we did not check whether it has the same effect (will it be saved when usingmodel.state_dict()?). There is also the convenience functionVariable.stack()used byLayer.stack(). Moreover, Variables can have names (to assign individual hyperparameter values to them), and they can have custom hyperparameter factors (e.g. to specify that the lr should be multiplied by a certain value for this Variable).The classes above can be given scope names, which are then prepended to variable names. For example, using scope names, the weight of the first linear layer in a NN could be called ‘net/first_layer/layer-0/weight’, where 0 is the layer index and ‘first_layer’ is redundant information that can be useful when regex matching variable names. One can assign an individual lr to this layer by using
lr={'': global_lr, '.*first_layer.*weight': first_layer_weight_lr}in**kwargsto theNNAlgInterface. This works as follows: TheHyperparamManager, which is available through a global context managed by theTrainContextclass, stores the hyperparameter configurations obtained through **kwargs. Different classes can require getters for specific hyperparameters for specific variables. If multiple lr values are specified above, the one from the last matching regex is taken. The scope names are passed on from FitterFactory to Fitter and then to Layer and Variable by a somewhat complicated context manager system, for which I didn’t find a more elegant solution.Fitter objects can be split up in three parts using the
split_off_dynamic()andsplit_off_individual()functions. The static part would typically be the one-hot encoding, since it does not depend on the data and is not trainable, which means that even in a vectorized context, it can be applied once to the single shared data set since it does not depend on the train/val/test split.
Then, there is the dynamic but not individual part, which can depend on the fitting data but is not trained or randomized, and can therefore be shared by models with the same trainval-test split. Finally, there is the individual (trainable/randomized) part, which is usually the NN part.Fitterclasses should implement methods that allow to estimate the RAM usage of the parameters and a forward pass, which allows to decide how many NNs fit onto a GPU when running the benchmark.