A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam)

Author: Lili Jiang

Cattura 2 1

Vanilla Gradient Descent

Cattura 3

Step-by-step illustration of gradient descent algorithm.


cattura 4

The gradient descent with momentum algorithm (or Momentum for short) borrows the idea from physics. Imagine rolling down a ball inside of a frictionless bowl. Instead of stopping at the bottom, the momentum it has accumulated pushes it forward, and the ball keeps rolling back and forth.

We can apply the concept of momentum to our vanilla gradient descent algorithm. In each step, in addition to the regular gradient, it also adds on the movement from the previous step. Mathematically, it is commonly expressed as:

delta = – learning_rate * gradient + previous_delta * decay_rate (eq. 1)

theta += delta (eq. 2)

I found it more intuitive if I massage this equation a little and keep track of the (decayed) cumulative sum of gradient instead. This will also make things easier when we introduce the Adam algorithm later.

sum_of_gradient = gradient + previous_sum_of_gradient * decay_rate (eq. 3)

delta = -learning_rate * sum_of_gradient (eq. 4)

theta += delta (eq. 5)

(What I did was factoring out -learning_rate. To see the mathematical equivalence, you can substitute delta with -learning_rate * sum_of_gradient in eq. 1 to get eq. 3.)


Step-by-step illustration of momentum descent. Watch live animation in the app. For the rest of this post, I sloppily use gradient x and gradient y in the visualization; in reality, because it’s gradient *descent*, it’s actually the negative of the gradient.


Let’s consider two extreme cases to understand this decay rate parameter better. If the decay rate is 0, then it is exactly the same as (vanilla) gradient descent. If the decay rate is 1 (and provided that the learning rate is reasonably small), then it rocks back and forth endlessly like the frictionless bowl analogy we mentioned in the beginning; you do not want that. Typically the decay rate is chosen around 0.8–0.9 — it’s like a surface with a little bit of friction so it eventually slows down and stops.

immage 6


Cattura 7

Step-by-step illustration of AdaGrad descent. Watch live animation in the app.


cattura 8



This property allows AdaGrad (and other similar gradient-squared-based methods like RMSProp and Adam) to escape a saddle point much better. AdaGrad will take a straight path, whereas gradient descent (or relatedly, Momentum) takes the approach of “let me slide down the steep slope first and maybe worry about the slower direction later”. Sometimes, vanilla gradient descent might just stop at the saddle point where gradients in both directions are 0 and be perfectly content there.


The problem of AdaGrad, however, is that it is incredibly slow. This is because the sum of gradient squared only grows and never shrinks. RMSProp (for Root Mean Square Propagation) fixes this issue by adding a decay factor.
sum_of_gradient_squared = previous_sum_of_gradient_squared * decay_rate+ gradient² * (1- decay_rate)
delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
theta += delta

More precisely, the sum of gradient squared is actually the decayed sum of gradient squared. The decay rate is saying only recent gradient² matters, and the ones from long ago are basically forgotten. As a side note, the term “decay rate” is a bit of a misnomer. Unlike the decay rate we saw in momentum, in addition to decaying, the decay rate here also has a scaling effect: it scales down the whole term by a factor of (1 – decay_rate). In other words, if the decay_rate is set at 0.99, in addition to decaying, the sum of gradient squared will be sqrt(1 – 0.99) = 0.1 that of AdaGrad, and thus the step is on the order of 10x larger for the same learning rate.


To see the effect of the decaying, in this head-to-head comparison, AdaGrad white) keeps up with RMSProp (green) initially, as expected with the tuned learning rate and decay rate. But the sums of gradient squared for AdaGrad accumulate so fast that they soon become humongous (demonstrated by the sizes of the squares in the animation). They take a heavy toll and eventually AdaGrad practically stops moving. RMSProp, on the other hand, has kept the squares under a manageable size the whole time, thanks to the decay rate. This makes RMSProp faster than AdaGrad.


Last but not least, Adam (short for Adaptive Moment Estimation) takes the best of both worlds of Momentum and RMSProp. Adam empirically works well, and thus in recent years, it is commonly the go-to choice of deep learning problems.

Let’s take a look at how it works:

sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1 – beta1) [Momentum]

sum_of_gradient_squared = previous_sum_of_gradient_squared * beta2 + gradient² * (1- beta2) [RMSProp]

delta = -learning_rate * sum_of_gradient / sqrt(sum_of_gradient_squared)

theta += delta

Beta1 is the decay rate for the first moment, sum of gradient (aka momentum), commonly set at 0.9. Beta 2 is the decay rate for the second moment, sum of gradient squared, and it is commonly set at 0.999.


Adam gets the speed from momentum and the ability to adapt gradients in different directions from RMSProp. The combination of the two makes it powerful.


Closing Words


Now that we have discussed all the methods, let’s watch a few races of all the descent methods we talked about so far! (There is some inevitable cherry-picking of parameters. The best way to get a taste is to play around yourself.)


In this terrain, there are two little hills blocking the way to the global minimum. Adam is the only one able to find its way to the global minimum. Whichever way the parameters are tuned, from this starting position at least, none of the other methods can get there. This means neither momentum nor adaptive gradient alone can do the trick. It’s really the combination of the two: first, momentum takes Adam beyond the local minimum where all the other balls stop; then, the adjustment from the sum of gradient squared pulls it sideway, because it is the direction less explored, leading to its final victory.

Here is another race. In this terrain, there is a flat region (plateau) surrounding the global minimum. With some parameter tuning, Momentum and Adam (thanks to its momentum component) can make it to the center, while the other methods can’t.
In summary, gradient descent is a class of algorithms that aims to find the minimum point on a function by following the gradient. Vanilla gradient descent just follows the gradient (scaled by learning rate). Two common tools to improve gradient descent are the sum of gradient (first moment) and the sum of the gradient squared (second momentum). The Momentum method uses the first moment with a decay rate to gain speed. AdaGrad uses the second moment with no decay to deal with sparse features. RMSProp uses the second moment by with a decay rate to speed up from AdaGrad. Adam uses both first and second moments, and is generally the best choice. There are a few other variations of gradient descent algorithms, such as Nesterov accelerated gradient, AdaDelta, etc., that are not covered in this post.
Lastly I shall leave you with this momentum descent with no decay. Its path makes up a fun pattern. I see no practical use (yet) but present it here just for the funsies. [Edit: I take my word back about no practical use. Read more about this curve at]






Deep Learning for Object Detection: A Comprehensive Review

Author Joyce Xu






With the rise of autonomous vehicles, smart video surveillance, facial detection and various people counting applications, fast and accurate object detection systems are rising in demand. These systems involve not only recognizing and classifying every object in an image, but localizing each one by drawing the appropriate bounding box around it. This makes object detection a significantly harder task than its traditional computer vision predecessor, image classification.

Fortunately, however, the most successful approaches to object detection are currently extensions of image classification models. A few months ago, Google released a new object detection API for Tensorflow. With this release came the pre-built architectures and weights for a few specific models:

In my last blog post, I covered the intuition behind the three base network architectures listed above: MobileNets, Inception, and ResNet. This time around, I want to do the same for Tensorflow’s object detection models: Faster R-CNN, R-FCN, and SSD. By the end of this post, we will hopefully have gained an understanding of how deep learning is applied to object detection, and how these object detection models both inspire and diverge from one another.

Faster R-CNN

Faster R-CNN is now a canonical model for deep learning-based object detection. It helped inspire many detection and segmentation models that came after it, including the two others we’re going to examine today. Unfortunately, we can’t really begin to understand Faster R-CNN without understanding its own predecessors, R-CNN and Fast R-CNN, so let’s take a quick dive into its ancestry.


R-CNN is the grand-daddy of Faster R-CNN. In other words, R-CNN reallykicked things off.

R-CNN, or Region-based Convolutional Neural Network, consisted of 3 simple steps:

  1. Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
  2. Run a convolutional neural net (CNN) on top of each of these region proposals
  3. Take the output of each CNNand feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.

These 3 steps are illustrated in the image below:



In other words, we first propose regions, then extract features, and then classify those regions based on their features. In essence, we have turned object detection into an image classification problem. R-CNN was very intuitive, but very slow.

Fast R-CNN

R-CNN’s immediate descendant was Fast-R-CNN. Fast R-CNN resembled the original in many ways, but improved on its detection speed through two main augmentations:

  1. Performing feature extraction over the image beforeproposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
  2. Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model

The new model looked something like this:


3 1

As we can see from the image, we are now generating region proposals based on the last feature map of the network, not from the original image itself. As a result, we can train just one CNN for the entire image.

In addition, instead of training many different SVM’s to classify each object class, there is a single softmax layer that outputs the class probabilities directly. Now we only have one neural net to train, as opposed to one neural net and many SVM’s.

Fast R-CNN performed much better in terms of speed. There was just one big bottleneck remaining: the selective search algorithm for generating region proposals.

Faster R-CNN

At this point, we’re back to our original target: Faster R-CNN. The main insight of Faster R-CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it introduced the region proposal network (RPN).

Here’s how the RPN worked:

  • At the last layer of an initial CNN, a 3×3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
  • For each sliding-window location, it generates multiplepossible regions based on k fixed-ratio anchor boxes(default bounding boxes)
  • Each region proposal consists of a) an “objectness” score for that region and b) 4 coordinates representing the bounding box of the region

In other words, we look at each location in our last feature map and consider kdifferent boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:

4 1

4 1

The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.

Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.


5 1

5 1

Altogether, Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. In other words, Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. Case in point, Tensorflow’s Faster R-CNN with Inception ResNet is their slowest but most accurate model.

At the end of the day, Faster R-CNN may look complicated, but its core design is the same as the original R-CNN: hypothesize object regions and then classify them. This is now the predominant pipeline for many object detection models, including our next one.


Remember how Fast R-CNN improved on the original’s detection speed by sharing a single CNN computation across all region proposals? That kind of thinking was also the motivation behind R-FCN: increase speed by maximizing shared computation.

R-FCN, or Region-based FullyConvolutional Net, shares 100% of the computations across every single output. Being fully convolutional, it ran into a unique problem in model design.

On the one hand, when performing classification of an object, we want to learn location invariance in a model: regardless of where the cat appears in the image, we want to classify it as a cat. On the other hand, when performing detection of the object, we want to learn location variance: if the cat is in the top left-hand corner, we want to draw a box in the top left-hand corner. So if we’re trying to share convolutional computations across 100% of the net, how do we compromise between location invariance and location variance?

R-FCN’s solution: position-sensitive score maps.

Each position-sensitive score map represents one relative position of one object class. For example, one score map might activate wherever it detects the top-right of a cat. Another score map might activate where it sees the bottom-left of a car. You get the point. Essentially, these score maps are convolutional feature maps that have been trained to recognize certain parts of each object.

Now, R-FCN works as follows:

  1. Run a CNN (in this case, ResNet) over the input image
  2. Add a fully convolutional layer to generate a score bankof the aforementioned “position-sensitive score maps.” There should be k²(C+1) score maps, with k² representing the number of relative positions to divide an object (e.g. 3² for a 3 by 3 grid) and C+1 representing the number of classes plus the background.
  3. Run a fully convolutional region proposal network (RPN) to generate regions of interest (RoI’s)
  4. For each RoI, divide it into the same k² “bins” or subregions as the score maps
  5. For each bin, check the score bank to see if that bin matches the corresponding position of some object. For example, if I’m on the “upper-left” bin, I will grab the score maps that correspond to the “upper-left” corner of an object and average those values in the RoI region. This process is repeated for each class.
  6. Once each of the k² bins has an “object match” value for each class, average the bins to get a single score per class.
  7. Classify the RoI with a softmax over the remaining C+1 dimensional vector

Altogether, R-FCN looks something like this, with an RPN generating the RoI’s:

6 1

6 1

Even with the explanation and the image, you might still be a little confused on how this model works. Honestly, R-FCN is much easier to understand when you can visualize what it’s doing. Here is one such example of an R-FCN in practice, detecting a baby:






Simply put, R-FCN considers each region proposal, divides it up into sub-regions, and iterates over the sub-regions asking: “does this look like the top-left of a baby?”, “does this look like the top-center of a baby?” “does this look like the top-right of a baby?”, etc. It repeats this for all possible classes. If enough of the sub-regions say “yes, I match up with that part of a baby!”, the RoI gets classified as a baby after a softmax over all the classes.

With this setup, R-FCN is able to simultaneously address location variance by proposing different object regions, and location invariance by having each region proposal refer back to the same bank of score maps. These score maps should learn to classify a cat as a cat, regardless of where the cat appears. Best of all, it is fully convolutional, meaning all of the computation is shared throughout the network.

As a result, R-FCN is several times faster than Faster R-CNN, and achieves comparable accuracy.


Our final model is SSD, which stands for Single-Shot Detector. Like R-FCN, it provides enormous speed gains over Faster R-CNN, but does so in a markedly different manner.

Our first two models performed region proposals and region classifications in two separate steps. First, they used a region proposal network to generate regions of interest; next, they used either fully-connected layers or position-sensitive convolutional layers to classify those regions. SSD does the two in a “single shot,” simultaneously predicting the bounding box and the class as it processes the image.

Concretely, given an input image and a set of ground truth labels, SSD does the following:

  1. Pass the image through a series of convolutional layers, yielding several sets of feature maps at different scales (e.g. 10×10, then 6×6, then 3×3, etc.)
  2. For each location in eachof these feature maps, use a 3×3 convolutional filter to evaluate a small set of default bounding boxes. These default bounding boxes are essentially equivalent to Faster R-CNN’s anchor boxes.
  3. For each box, simultaneously predict a) the bounding box offset and b) the class probabilities
  4. During training, match the ground truth box with these predicted boxes based on IoU. The best predicted box will be labeled a “positive,” along with all other boxes that have an IoU with the truth >0.5.

SSD sounds straightforward, but training it has a unique challenge. With the previous two models, the region proposal network ensured that everything we tried to classify had some minimum probability of being an “object.” With SSD, however, we skip that filtering step. We classify and draw bounding boxes from every single position in the image, using multiple different shapes, at several different scales. As a result, we generate a much greater number of bounding boxes than the other models, and nearly all of the them are negative examples.

To fix this imbalance, SSD does two things. Firstly, it uses non-maximum suppression to group together highly-overlapping boxes into a single box. In other words, if four boxes of similar shapes, sizes, etc. contain the same dog, NMS would keep the one with the highest confidence and discard the rest. Secondly, the model uses a technique called hard negative mining to balance classes during training. In hard negative mining, only a subset of the negative examples with the highest training loss (i.e. false positives) are used at each iteration of training. SSD keeps a 3:1 ratio of negatives to positives.

Its architecture looks like this:



As I mentioned above, there are “extra feature layers” at the end that scale down in size. These varying-size feature maps help capture objects of different sizes. For example, here is SSD in action


In smaller feature maps (e.g. 4×4), each cell covers a larger region of the image, enabling them to detect larger objects. Region proposal and classification are performed simultaneously: given p object classes, each bounding box is associated with a (4+p)-dimensional vector that outputs 4 box offset coordinates and pclass probabilities. In the last step, softmax is again used to classify the object.

Ultimately, SSD is not so different from the first two models. It simply skips the “region proposal” step, instead considering every single bounding box in every location of the image simultaneously with its classification. Because SSD does everything in one shot, it is the fastest of the three models, and still performs quite comparably.


Faster R-CNN, R-FCN, and SSD are three of the best and most widely used object detection models out there right now. Other popular models tend to be fairly similar to these three, all relying on deep CNN’s (read: ResNet, Inception, etc.) to do the initial heavy lifting and largely following the same proposal/classification pipeline.

At this point, putting these models to use just requires knowing Tensorflow’s API. Tensorflow has a starter tutorial on using these models here. Give it a try, and happy hacking!



An intuitive guide to deep network architectures

Author: Joyce Xu




GoogLeNet, 2014

Over the past few years, much of the progress in deep learning for computer vision can be boiled down to just a handful of neural network architectures. Setting aside all the math, the code, and the implementation details, I wanted to explore one simple question: how and why do these models work?

At the time of writing, Keras ships with six of these pre-trained models already built into the library:

  • VGG16
  • VGG19
  • ResNet50
  • Inception v3
  • Xception
  • MobileNet

The VGG networks, along with the earlier AlexNet from 2012, follow the now archetypal layout of basic conv nets: a series of convolutional, max-pooling, and activation layers before some fully-connected classification layers at the end. MobileNet is essentially a streamlined version of the Xception architecture optimized for mobile applications. The remaining three, however, truly redefine the way we look at neural networks.

This rest of this post will focus on the intuition behind the ResNet, Inception, and Xception architectures, and why they have become building blocks for so many subsequent works in computer vision.


ResNet was born from a beautifully simple observation: why do very deep nets perform worse as you keep adding layers?

Intuitively, deeper nets should perform no worse than their shallower counterparts, at least at train time (when there is no risk of overfitting). As a thought experiment, let’s say we’ve built a net with n layers that achieves a certain accuracy. At minimum, a net with n+1layers should be able to achieve the exact same accuracy, if only by copying over the same first n layers and performing an identity mapping for the last layer. Similarly, nets of n+2n+3, and n+4layers could all continue performing identity mappings and achieve the same accuracy. In practice, however, these deeper nets almost always degrade in performance.

The authors of ResNet boiled these problems down to a single hypothesis: direct mappings are hard to learn. And they proposed a fix: instead of trying to learn an underlying mapping from x to H(x), learn the difference between the two, or the “residual.” Then, to calculate H(x), we can just add the residual to the input.

Say the residual is F(x)=H(x)-x. Now, instead of trying to learn H(x) directly, our nets are trying to learn F(x)+x.

This gives rise to the famous ResNet (or “residual network”) block you’ve probably seen:



ResNet block


Each “block” in ResNet consists of a series of layers and a “shortcut” connection adding the input of the block to its output. The “add” operation is performed element-wise, and if the input and output are of different sizes, zero-padding or projections (via 1×1 convolutions) can be used to create matching dimensions.

If we go back to our thought experiment, this simplifies our construction of identity layers greatly. Intuitively, it’s much easier to learn to push F(x) to 0 and leave the output as x than to learn an identity transformation from scratch. In general, ResNet gives layers a “reference” point — x — to start learning from.

This idea works astoundingly well in practice. Previously, deep neural nets often suffered from the problem of vanishing gradients, in which gradient signals from the error function decreased exponentially as they backpropogated to earlier layers. In essence, by the time the error signals traveled all the way back to the early layers, they were so small that the net couldn’t learn. However, because the gradient signal in ResNets could travel back directly to early layers via shortcut connections, we could suddenly build 50-layer, 101-layer, 152-layer, and even (apparently) 1000+ layer nets that still performed well. At the time, this was a huge leap forward from the previous state-of-the-art, which won the ILSVRC 2014 challenge with 22 layers.

ResNet is one of my personal favorite developments in the neural network world. So many deep learning papers come out with minor improvements from hacking away at the math, the optimizations, and the training process without thought to the underlying task of the model. ResNet fundamentally changed the way we understand neural networks and how they learn.

Fun facts:

  • The 1000+ layer net is open-source! I would not reallyrecommend you try re-training it, but…
  • If you’re feeling functional and a little frisky, I recently ported ResNet50 to the open-source Clojure ML library Cortex. Try it out and see how it compares to Keras!


If ResNet was all about going deeper, the Inception Family™ is all about going wider. In particular, the authors of Inception were interested in the computational efficiency of training larger nets. In other words: how can we scale up neural nets without increasing computational cost?

The original paper focused on a new building block for deep nets, a block now known as the “Inception module.” At its core, this module is the product of two key insights.

The first insight relates to layer operations. In a traditional conv net, each layer extracts information from the previous layer in order to transform the input data into a more useful representation. However, each layer type extracts a different kind of information. The output of a 5×5 convolutional kernel tells us something different from the output of a 3×3 convolutional kernel, which tells us something different from the output of a max-pooling kernel, and so on and so on. At any given layer, how do we know what transformation provides the most “useful” information?

Insight #1: why not let the model choose?

An Inception module computes multiple different transformations over the same input map in parallel, concatenating their results into a single output. In other words, for each layer, Inception does a 5×5 convolutional transformation, and a 3×3, and a max-pool. And the next layer of the model gets to decide if (and how) to use each piece of information.

3The increased information density of this model architecture comes with one glaring problem: we’ve drastically increased computational costs. Not only are large (e.g. 5×5) convolutional filters inherently expensive to compute, stacking multiple different filters side by side greatly increases the number of feature maps per layer. And this increase becomes a deadly bottleneck in our model.

Think about it this way. For each additional filter added, we have to convolve over all the input maps to calculate a single output. See the image below: creating one output map from a single filter involves computing over every single map from the previous layer.


Let’s say there are M input maps. One additional filter means convolving over Mmore maps; N additional filters means convolving over N*M more maps. In other words, as the authors note, “any uniform increase in the number of [filters] results in a quadratic increase of computation.” Our naive Inception module just tripled or quadrupled the number of filters. Computationally speaking, this is a Big Bad Thing.

This leads to insight #2: using 1×1 convolutions to perform dimensionality reduction. In order to solve the computational bottleneck, the authors of Inception used 1×1 convolutions to “filter” the depth of the outputs. A 1×1 convolution only looks at one value at a time, but across multiple channels, it can extract spatial information and compress it down to a lower dimension. For example, using 20 1×1 filters, an input of size 64x64x100 (with 100 feature maps) can be compressed down to 64x64x20. By reducing the number of input maps, the authors of Inception were able to stack different layer transformations in parallel, resulting in nets that were simultaneously deep (many layers) and “wide” (many parallel operations).

5How well did this work? The first version of Inception, dubbed “GoogLeNet,” was the 22-layer winner of the ILSVRC 2014 competition I mentioned earlier. Inception v2 and v3 were developed in a second paper a year later, and improved on the original in several ways — most notably by refactoring larger convolutions into consecutive smaller ones that were easier to learn. In v3, for example, the 5×5 convolution was replaced with 2 consecutive 3×3 convolutions.

Inception rapidly became a defining model architecture. The latest version of Inception, v4, even threw in residual connections within each module, creating an Inception-ResNet hybrid. Most importantly, however, Inception demonstrated the power of well-designed “network-in-network” architectures, adding yet another step to the representational power of neural networks.

Fun facts:

  • The original Inception paper literally cites the “we need to go deeper” internet meme as an inspiration for its name. This must be the first time got listed as the first reference of a Google paper.
  • The second Inception paper (with v2 and v3) was released just one day after the original ResNet paper. December 2015 was a good time for deep learning.


Xception stands for “extreme inception.” Rather like our previous two architectures, it reframes the way we look at neural nets — conv nets in particular. And, as the name suggests, it takes the principles of Inception to an extreme.

Here’s the hypothesis: “cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly.”

What does this mean? Well, in a traditional conv net, convolutional layers seek out correlations across both spaceand depth. Let’s take another look at our standard convolutional layer:


In the image above, the filter simultaneously considers a spatial dimension (each 2×2 colored square) and a cross-channel or “depth” dimension (the stack of four squares). At the input layer of an image, this is equivalent to a convolutional filter looking at a 2×2 patch of pixels across all three RGB channels. Here’s the question: is there any reason we need to consider both the image region and the channels at the same time?

In Inception, we began separating the two slightly. We used 1×1 convolutions to project the original input into several separate, smaller input spaces, and from each of those input spaces we used a different type of filter to transform those smaller 3D blocks of data. Xception takes this one step further. Instead of partitioning input data into several compressed chunks, it maps the spatial correlations for each output channel separately, and then performs a 1×1 depthwise convolution to capture cross-channel correlation.


6The author notes that this is essentially equivalent to an existing operation known as a “depthwise separable convolution,” which consists of a depthwise convolution (a spatial convolution performed independently for each channel) followed by a pointwise convolution (a 1×1 convolution across channels). We can think of this as looking for correlations across a 2D space first, followed by looking for correlations across a 1D space. Intuitively, this 2D + 1D mapping is easier to learn than a full 3D mapping.

And it works! Xception slightly outperforms Inception v3 on the ImageNet dataset, and vastly outperforms it on a larger image classification dataset with 17,000 classes. Most importantly, it has the same number of model parameters as Inception, implying a greater computational efficiency. Xception is much newer (it came out in April 2017), but as mentioned above, its architecture is already powering Google’s mobile vision applications through MobileNet.

Fun facts:

  • The author of Xception is also the author of Keras. Francois Chollet is a living god.

Moving forward

That’s it for ResNet, Inception, and Xception! I firmly believe in having a strong intuitive understanding of these networks, because they are becoming ubiquitous in research and industry alike. We can even use them in our own applications with something called transfer learning.

Transfer learning is a technique in machine learning in which we apply knowledge from a source domain (e.g. ImageNet) to a target domain that may have significantly fewer data points. In practice, this generally involves initializing a model with pre-trained weights from ResNet, Inception, etc. and either using it as a feature extractor, or fine-tuning the last few layers on a new dataset. With transfer learning, these models can be re-purposed for any related task we want, from object detection for self-driving vehicles to generating captions for video clips.

To get started with transfer learning, Keras has a wonderful guide to fine-tuning models here. If it sounds interesting to you, check it out — and happy hacking!


Using Deep Learning to improve FIFA 18 graphics

Author: Chintan Trivedi





Comparison of Cristiano Ronaldo’s face, with the left one from FIFA 18 and the right one generated by a Deep Neural Network.

Game Studios spend millions of dollars and thousands of development hours designing game graphics in trying to make them look as close to reality as possible. While the graphics have looked amazingly realistic in the last few years, it is still easy to distinguish them from the real world. However, with the massive advancements made in the field of image processing using Deep Neural Networks, is it time we can leverage that to improve the graphics while simultaneously also reducing the efforts required to create them?

Let us try to answer that using the game FIFA 18…

Football (i.e. soccer) being my favorite sport, FIFA becomes the natural game of choice for all of my deep learning experiments. To find out whether the recent developments in deep learning can help me answer my question, I tried to focus on improving the player faces in FIFA using the (in?)famous deepfakesalgorithm. It is a Deep Neural Network that can be trained to learn and generate extremely realistic human faces. My focus in this project lies on recreating the player faces from within the game and improving them to make them look exactly like the actual players.

Note: Here is a great explanation of how the deepfakes algorithm works. Tl;dr version: it can swap the face of anyone in a video with anybody else’s face using Autoencoders and Convolutional Neural Networks.

Gathering training data

Unlike the game developers, I could collect all required data from Google search without having to trouble Ronaldo with any motion-capture fancy dress.

Let us start by looking at one of the best designed faces in FIFA 18, that of Cristiano Ronaldo, and see if we can improve it. To gather the data required for the deepfakes algorithm, I simply recorded the player’s face from the instant replay option in the game. Now, we want to replace this face with the actual face of Ronaldo. For this, I downloaded a bunch of images from Google such that the images clearly show his face from different angles. That’s all that is needed to get us started with the training process of our model.

Model architecture & Training

The deepfakes algorithm involves training of deep neural networks called autoencoders. These networks are used for unsupervised learning and have an encoder that can encode an input to a compact representation called the “encoding”, and a decoder that can use this encoding to reconstruct the original input. This architecture forces the network to learn the underlying distribution of the input rather than simply parroting back the input. For images as our input, we use a convolutional net as our encoder and a deconvolutional net as our decoder. This architecture is trained to minimize the reconstruction error for unsupervised learning.

For our case, we train two autoencoder networks simultaneously. One network learns to recreate face of Ronaldo from FIFA 18 graphics. The other network learns to recreate the face from actual pictures of Ronaldo. In deepfakes, both networks share the same encoder but are trained with different decoders. Thus, we now have two networks that have learnt how Ronaldo looks like in the game and in real life.

2.jpg 5

2.jpg 5

  1. First autoencoder network learning from FIFA graphics

3 2

Second autoencoder network from learning actual pictures

When training using a pre-trained model on other faces, the total loss goes down from around 0.06 to 0.02 within 4 hours on a GTX 1070. In my case, I continued training on top of the original CageNet model that has been trained to generate Nicolas Cage’s face.

Using the trained models to swap faces

Now comes the fun part. The algorithm is able to swap faces by adopting a clever trick. The second autoencoder network is actually fed with the input of the first one. This way, the shared encoder is able to get the encoding from FIFA face, but the decoder reconstructs the real face using this encoding. Voila, this setup just converted the face from FIFA to the actual face of Ronaldo.

4 2

The second network converting FIFA face to real face of Ronaldo


The GIF below shows a quick preview of results from running this algorithm on faces of other players. I think the improvement is astonishing, but maybe I am biased, so you be the judge.


5 1


6What if you could play “The Journey” mode of the game as yourself instead of playing as Alex Hunter? All you got to do is upload a minute long video of yourself and download the trained model in a few hours. There you go, you may now play the entire Journey mode as yourself. Now that’d be some next level of immersive gaming!

Where it excels and where it needs more work

The biggest advantage I feel we get with this approach is the amazing life-like faces and graphics that are hard to distinguish from the real world. All of this can be achieved with only a few hours of training, compared to years taken by game designers with the current approach. This means game publishers can come out with new titles much faster rather than spending decades in development. This also means that the studios can save millions of dollars that could now be put into hiring decent story-writers.

The glaring limitation so far is that these faces have been generated post facto, like CGI in movies, while games requires them to be generated in real time. However, one big difference is that this approach does not require any human intervention for generating results once a model has been trained, and the only thing holding it back is the computation time required in generating the output image. I believe it is not going to be very long before we have light weight, not-too-deep generative models that can run very fast without compromising output quality, just like we now have YOLO and SSD MobileNets for real-time object detection, something that wasn’t possible with previous models like RCNNs.


If someone like me, who has no experience in graphics designing, can come up with improved faces within just a few hours, I truly believe that if game developers were to invest heavily in this direction it could change the face of gaming industry (yes, intended) in the not-too-distant future. Now if only anyone from EA sports was reading this…


Building a Deep Neural Network to play FIFA 18

Author: Chintan Trivedi



A.I. bots in gaming are usually built by hand-coding a bunch of rules that impart game-intelligence. For the most part, this approach does a fairly good job of making the bot imitate human-like behavior. However, for most games it is still easy to tell apart a bot from an actual human playing. If we want to make these bots behave more human-like, would it help to not build them using hand-coded rules? What if we simply let the bot figure out the game by making it learn from looking at how humans play?

Exploring this would require a game where it is possible to collect such data of humans playing the game ahead of developing the game itself. FIFA is one such game that let me explore this. Being able to play the game and record my in-game actions and decisions allowed me to train an end-to-end Deep Learning based bot without having to hard-code a single rule of the game.

The code for this project along with the trained model can be found here:

Mechanism for playing the game

The underlying mechanism to build such a bot needs to work without having access to any of the game’s internal code. Good thing then that the premise of this bot says we do not want to look at any such in-game information. A simple screenshot of the game window is all that is needed to feed into the bot’s game engine. It processes this visual information and outputs the action it wants to take which gets communicated to the game using a key-press simulation. Rinse and repeat.

4 1

4 1

Now that we have a framework in place to feed input to the bot and to let its output control the game, we come to the interesting part: learning game intelligence. This is done in two steps by (1) using convolution neural network for understanding the screenshot image and (2) using long short term memory networks to decide appropriate action based on the understanding of the image.

STEP 1: Training Convolution Neural Network (CNN)

CNNs are well known for their ability to detect objects in an image with high accuracy. Add to that fast GPUs and intelligent network architectures and we have a CNN model that can run in real time.

5 2

5 2

For making our bot understand the image it is given as input, I use an extremely light weight and fast CNN called MobileNet. The feature map extracted from this network represents a high level understanding of the image, like where the players and other objects of interest are located on the screen. This feature map is then used with a Single-Shot Multi-Box to detect the players on the pitch along with the ball and the goal.



STEP 2: Training Long Short Term Memory Networks (LSTM)




Now that we have an understanding of the image, we could go ahead and decide what move we want to make. However, we don’t want to look at just one frame and take action. We’d rather look at a short sequence of these images. This is where LSTMs come into picture as they are well known for being able to model temporal sequences in data. Consecutive frames are used as time steps in our sequence, and a feature map is extracted for each frame using the CNN model. These are then fed into two LSTM networks simultaneously.

The first LSTM performs the task of learning what movement the player needs to make. Thus, it’s a multi-class classification model. The second LSTM gets the same input and has to decide what action to take out of cross, through, pass and shoot: another multi-class classification model. The outputs from these two classification problems are then converted to key presses to control the actions in the game.

These networks have been trained on data collected by manually playing the game and recording the input image and the target key press. One of the few instances where gathering labelled data does not feel like a chore!

Evaluating the bot’s performance

I don’t know what accuracy measure to use in order to judge the bot’s performance, other than to let it just go out there and play the game. Based on only 400 minutes of training, the bot has learned to make runs towards the opponent’s goal, make forward passes and take shots when it detects the goal. In the beginner mode of FIFA 18, it has already scored 4 goals in about 6 games, 1 more than Paul Pogba has in the 17/18 season as of time of writing.

Video clips of the bot playing against the inbuilt bot can be found on my YouTube channel, with the video embedded below.


My initial impressions on this approach of building game bots are certainly positive. With limited training, the bot has already picked up on basic rules of the game: making movements towards the goal and putting the ball in the back of the net. I believe it can get very close to human level performance with many more hours of training data, something that would be easy for the game developer to collect. Moreover, extending the model training to learn from real world footage of matches played would enable the game developers to make the bot’s behavior much more natural and realistic. Now if only anyone from EA sports was reading this…


How I Shipped a Neural Network on iOS with CoreML, PyTorch, and React Native