S2 Fig -
(a) Architecture of variational autoencoder. The encoder used for mapping images to the latent space is shown on the left. This encoder takes images as input and returns Gaussian parameters in the latent space that correspond to this image. The decoder used for mapping from the latent space back into the image space is shown on the right. (b) VoxNet architecture used in the classification tasks. The input images are of size 32 × 32 × 32. The notation r × Conv3D-k (3 × 3 × 3) means that there are r 3D convolutional layers (one feeds into the other) each with k filters of size 3 × 3 × 3. MaxPool3D(2 × 2 × 2) indicates a 3D max pooling layer with pooling size 2 × 2 × 2. FC-k indicates a fully connected layer with k neurons. Note that the PReLU activation function is used in every convolutional layer while ReLU activation functions are used in the fully connected layers. Finally, batch normalization is followed by every convolutional layer.