Icone color1 11

Capsule Neural Networks

Author: Matteo Alberti



CNN highly perform when our validation set is very close to the training data but if the images have rotation, tilt or commonly any kind of affine transformations than CNNs have poor performance. As we have seen, until now, the main resolution was given adding more layers (due to improve generalization) where innermost layers extract simple features like small curves and edges while higher up layers combine simple features into complex one.


To achieve this computation in reasonable time we use max-pooling operator after each convolutional layers. However, this operator loses out the positional data creating a sort of positional invariance



So the recent advancement of Capsule Networks is to achieve not invariance (small tolerant to changes) but Equivariance (total invariance and accordingly adaptively) replacing max-pooling operator with a new one: dynamic routing.





A capsule is a group of neurons.  The activity vector of a capsules represents the instantiation parameters when an object (or a part of it) is detected by the network.

The length of this activity vector will represent the probability of existence while the orientation of the vector encodes all Pose-Matrix information (rotation, translation).



Pose Matrix draws inspiration from rendering (computer graphics) that is, the construction of an image starting from a hierarchical representation of geometric data.  In this case the goal is to build a sort of inverse-rendering:


The image is deconstructed on the basis of a hierarchical representation matching it with extracted features”


When a capsule at lower level is active makes predictions to higher-level capsule, via transformation matrices. If multiple predictions agree a higher level capsule becomes active. These result is achieve replacing max-pooling with iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.






The simple Capsule Network architecture is very shallow, only two convolutional layers and one fully-connected.


First Convolutional layer has 256 kernels (9×9, stride equal to one) with a ReLu activation. This because we need a spatial reduction and convert pixel intensities to the activities of local features detectors used ad inputs to the primary capsules.

Primary capsules are the lowest level of multidimensional entities and corresponds to the inverse-rendering process.

Second layer of primary capsules is a convolutional one, 32 channels of 8 dimensional capsules where each primary capsules has 8 convolutional units (9×9 kernels with stride equal to 2). Each capsule corresponds to a features. Primary Capsules gives 32x6x6 capsule output of 8D vector where inside the 6×6 grid capsules share their weights.

Final layer, called DigitCaps, produce a 16D capsule, one for each target class (like softmax but neurons are replacing by capsules)


So the main differences are given by:


  • Replace neurons with capsules (and so scalar with vectors)
  • A new activation function (able to works with vectors)
  • Replace max-pooling with dynamic by routing




Squash Function


As we have seen we want that length of an activity vector represents the probability of existence, therefore we use a non-linear function called “squashing” that do a sort of normalization from zero to one.


Vectors near to zero will be shrunk to almost zero while long one slightly below to one.


 keep the orientation of the extracted features





Dynamic Routing


Dynamic Routing isn’t just a smarter max-pooling operator that preserve spatial information but also give us the opportunity to preserve Hierarchy of the parts, this is an alternative way to do forward pass.


Dynamic Routing is an Iterative Process that connect capsules with similar activities.


2.jpg 1

In every iteration each capsule compute a matrix multiplication with the output of the capsules keeping only with higher norm and similar orientation.






So  is given by:  

Where  is a not negative scalar. For each capsule at below level i, the sum of all  is equal to one ( ) and for each capsule the number of weights is equal to the number of capsule at upper level.


So with dynamic routing we first connect each capsules at one level with all capsule at upper layer, like FC layer, then due to the iterative process we create a routing between capsules like a sort of intelligent dropout, keeping only the most likely connections.




Training of weights and Loss Function


When the forward pass is completed and fixed we are going to training our weights :

We have to neural networks to do this step:


  1. First network, during backpropagation step, maximize the norm of the capsule of the target (L^i _\mu )
  2. Another version of the network use MLP (like Autoencoding) due to reconstruct the input-target image. ( L^i _p)


So the goal of this two networks will be to minimize the following function:

Loss Function: L^i = L^i_\mu + \rho L^i_p  where \rho  is weighed by a small coefficient p (p=0.005)




Reconstruction and Regularization


As we seen in training of weights we will use an MLP as secondary network:

During Training step, we Mask out all except the activity vector of the correct digit capsule (one class predicted) than, while activity vector contains all information from Pose-Matrix, we reconstruct the input image.



The output of the digit capsule is fed into a decoder (MLP) that model the pixel intensities, then we minimize the sum of squares differences between the output and the pixel intensities.







Until now Capsule Networks, while achieved the state-of-art in image classification, are tested only on MNIST data. But there are some relevant considerations to do analysing not only the accuracy achieved but many factors and proprieties of the model.

We compare Capsule Networks with 2 different combination of parameters:

  • number of routing process
  • Reconstruction (without reconstruction Loss Function will become: = )

And one optimized convolutional neural network (Baseline)


While Capsule Network achieve the best results the most relevant information is given by:


Capsule Networks CNN (Baseline)
Number of Parameters 8.2M 34.4M
Affine Transformation Robustness 79% 66%


The number of parameters is more than 4 times lower and the proprieties of equivariance of capsules allow us to achieve best results on affine transformation, due to improve generalization.


Summing Capsule Proprieties:


  • Allows a hierarchy of parts and spatial information
  • Robustness to Affine Transformations
  • Need less data due to generalizations proprieties
  • High discriminatory capacity on overlapping objects


But Caps Net are just not tested on other dataset (large datasets like ImageNet) and routing-by-agreement process slows down the training step