A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam)

Author: Lili Jiang

Cattura 2 1

Vanilla Gradient Descent

Cattura 3

Step-by-step illustration of gradient descent algorithm.


cattura 4

The gradient descent with momentum algorithm (or Momentum for short) borrows the idea from physics. Imagine rolling down a ball inside of a frictionless bowl. Instead of stopping at the bottom, the momentum it has accumulated pushes it forward, and the ball keeps rolling back and forth.

We can apply the concept of momentum to our vanilla gradient descent algorithm. In each step, in addition to the regular gradient, it also adds on the movement from the previous step. Mathematically, it is commonly expressed as:

delta = – learning_rate * gradient + previous_delta * decay_rate (eq. 1)

theta += delta (eq. 2)

I found it more intuitive if I massage this equation a little and keep track of the (decayed) cumulative sum of gradient instead. This will also make things easier when we introduce the Adam algorithm later.

sum_of_gradient = gradient + previous_sum_of_gradient * decay_rate (eq. 3)

delta = -learning_rate * sum_of_gradient (eq. 4)

theta += delta (eq. 5)

(What I did was factoring out -learning_rate. To see the mathematical equivalence, you can substitute delta with -learning_rate * sum_of_gradient in eq. 1 to get eq. 3.)


Step-by-step illustration of momentum descent. Watch live animation in the app. For the rest of this post, I sloppily use gradient x and gradient y in the visualization; in reality, because it’s gradient *descent*, it’s actually the negative of the gradient.


Let’s consider two extreme cases to understand this decay rate parameter better. If the decay rate is 0, then it is exactly the same as (vanilla) gradient descent. If the decay rate is 1 (and provided that the learning rate is reasonably small), then it rocks back and forth endlessly like the frictionless bowl analogy we mentioned in the beginning; you do not want that. Typically the decay rate is chosen around 0.8–0.9 — it’s like a surface with a little bit of friction so it eventually slows down and stops.

immage 6


Cattura 7

Step-by-step illustration of AdaGrad descent. Watch live animation in the app.


cattura 8



This property allows AdaGrad (and other similar gradient-squared-based methods like RMSProp and Adam) to escape a saddle point much better. AdaGrad will take a straight path, whereas gradient descent (or relatedly, Momentum) takes the approach of “let me slide down the steep slope first and maybe worry about the slower direction later”. Sometimes, vanilla gradient descent might just stop at the saddle point where gradients in both directions are 0 and be perfectly content there.


The problem of AdaGrad, however, is that it is incredibly slow. This is because the sum of gradient squared only grows and never shrinks. RMSProp (for Root Mean Square Propagation) fixes this issue by adding a decay factor.
sum_of_gradient_squared = previous_sum_of_gradient_squared * decay_rate+ gradient² * (1- decay_rate)
delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
theta += delta

More precisely, the sum of gradient squared is actually the decayed sum of gradient squared. The decay rate is saying only recent gradient² matters, and the ones from long ago are basically forgotten. As a side note, the term “decay rate” is a bit of a misnomer. Unlike the decay rate we saw in momentum, in addition to decaying, the decay rate here also has a scaling effect: it scales down the whole term by a factor of (1 – decay_rate). In other words, if the decay_rate is set at 0.99, in addition to decaying, the sum of gradient squared will be sqrt(1 – 0.99) = 0.1 that of AdaGrad, and thus the step is on the order of 10x larger for the same learning rate.


To see the effect of the decaying, in this head-to-head comparison, AdaGrad white) keeps up with RMSProp (green) initially, as expected with the tuned learning rate and decay rate. But the sums of gradient squared for AdaGrad accumulate so fast that they soon become humongous (demonstrated by the sizes of the squares in the animation). They take a heavy toll and eventually AdaGrad practically stops moving. RMSProp, on the other hand, has kept the squares under a manageable size the whole time, thanks to the decay rate. This makes RMSProp faster than AdaGrad.


Last but not least, Adam (short for Adaptive Moment Estimation) takes the best of both worlds of Momentum and RMSProp. Adam empirically works well, and thus in recent years, it is commonly the go-to choice of deep learning problems.

Let’s take a look at how it works:

sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1 – beta1) [Momentum]

sum_of_gradient_squared = previous_sum_of_gradient_squared * beta2 + gradient² * (1- beta2) [RMSProp]

delta = -learning_rate * sum_of_gradient / sqrt(sum_of_gradient_squared)

theta += delta

Beta1 is the decay rate for the first moment, sum of gradient (aka momentum), commonly set at 0.9. Beta 2 is the decay rate for the second moment, sum of gradient squared, and it is commonly set at 0.999.


Adam gets the speed from momentum and the ability to adapt gradients in different directions from RMSProp. The combination of the two makes it powerful.


Closing Words


Now that we have discussed all the methods, let’s watch a few races of all the descent methods we talked about so far! (There is some inevitable cherry-picking of parameters. The best way to get a taste is to play around yourself.)


In this terrain, there are two little hills blocking the way to the global minimum. Adam is the only one able to find its way to the global minimum. Whichever way the parameters are tuned, from this starting position at least, none of the other methods can get there. This means neither momentum nor adaptive gradient alone can do the trick. It’s really the combination of the two: first, momentum takes Adam beyond the local minimum where all the other balls stop; then, the adjustment from the sum of gradient squared pulls it sideway, because it is the direction less explored, leading to its final victory.

Here is another race. In this terrain, there is a flat region (plateau) surrounding the global minimum. With some parameter tuning, Momentum and Adam (thanks to its momentum component) can make it to the center, while the other methods can’t.
In summary, gradient descent is a class of algorithms that aims to find the minimum point on a function by following the gradient. Vanilla gradient descent just follows the gradient (scaled by learning rate). Two common tools to improve gradient descent are the sum of gradient (first moment) and the sum of the gradient squared (second momentum). The Momentum method uses the first moment with a decay rate to gain speed. AdaGrad uses the second moment with no decay to deal with sparse features. RMSProp uses the second moment by with a decay rate to speed up from AdaGrad. Adam uses both first and second moments, and is generally the best choice. There are a few other variations of gradient descent algorithms, such as Nesterov accelerated gradient, AdaDelta, etc., that are not covered in this post.
Lastly I shall leave you with this momentum descent with no decay. Its path makes up a fun pattern. I see no practical use (yet) but present it here just for the funsies. [Edit: I take my word back about no practical use. Read more about this curve at https://en.wikipedia.org/wiki/Lissajous_curve.]






How to get a job as a Data Scientist?

Author: Favio Vázquez



1 1

Hi everyone. This blog post comes from 3 post I did recently at LinkedIn. Here they are Part 1Part 2, and Part 3.

This is a hard question to answer. Hang with me in this one (and this is not the final answer about the universe, existence and everything).

This is one of the questions I receive most frequently from people I know or my connections @ LinkedIn.

I’ll tell you my experience. I’ve been working as a Data Scientist for some time now (even though some people still believe this is not a profession, or maybe not a new one). But I’ll talk about how I got my current job at BBVA Data & Analytics.

I finished my master last year, I did one in Physics working with Cosmology and Bayesian Machine Learning. Just before graduating I thought about what I wanted to do, and I decided that I wanted to work in the field of Data Science.

I applied to almost 125 jobs (for real, maybe you applied for much more), I got only like 25–30 replies. Some of them were just: Thanks but nope. And I got almost 15 interviews. I learned from each one. Got better. I had to deal with a lot of rejection. Something I was actually not prepared to. But I loved the process of getting interviewed (not all of them to be honest). I studied a lot, programmed everyday, read a lot of articles and posts. They helped a lot.

But, how did I get this job?



2.jpg 1

With a lot of patience. It was not easy, but by the 7th interview I realized several things:

  • Some people had no idea what Data Science was.
  • The recruiter is your best friend at the moment of interviews, they want to help you get in. So trust them, let them help you and ask questions!
  • People is more interested in how you solve problems and how you deal with some specific situations than your technical knowledge.

I was really prepared to answer questions about algorithms, machine learning, Python, Spark, etc., but I was not ready to answer questions about how did I solve a problem, or how would I tackle a situation.

By the 8th interview I review everything I did before as a Data Scientist, as a computer engineer, as a physicist and as a human being. I was ready to answer questions about real life work, how to deal with complicated situations, how to deal with new data, how to do a data science workflow, how to explain hard concepts to managers and more.

I did much better. I was also calmed. I knew that this people interviewing me were trying to get me into the company, this was not the inquisition.


My suggestions then to get a job a as a Data Scientist:

  • Be patient. You will apply for maybe hundreds of job before getting one.
  • A lot. Not only studying important concepts, programming and answering business questions, also remember that you will be an important piece of the organization, you will deal with different people and situations, be ready to answer questions about how would you behave in different work situations.
  • Have a portfolio. If you are looking for a serious paid job in data science do some projects with real data. If you can post them on GitHub. Apart from Kaggle competitions, find something that you love or a problem you want to solve and use your knowledge to do it.
  • The recruiter is your friend. The people interviewing you too. They want you to get in the company, that’s a powerful advise that I remember everyday.
  • Ask people about what they do. I recommend that you follow Matthew Mayopost on “A day in the life of a Data Scientist” to have a better idea of what we do.
  • If you want an internship, have your academic skills on point.

I wish you the best and success :).

Follow me here: https://www.linkedin.com/in/faviovazquez/.

A gentle introduction to Learning Methods

Author: Matteo Alberti


Traditionally, there have been three fundamentally different types of tasks in machine learning:


  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning



Supervised Learning

Supervised learning is the machine learning task of inferring a function from supervised training data.

The training data consist of a set of training examples where each example is a pair consisting of an input object (typically a vector) and a desired output value.

The goal is to learn a mapping from x to y, given a training set made of pairs (xi, yi) are sampled i.i.d from some distribution which here ranges over X x Y .


A supervised learning algorithm analyses the training data and produces an inferred function that can be further grouped into regression and classification problems:


  • Classification: A classification problem is when the output variable is a category (labels are discrete)
  • Regression: A regression problem is when the output variable is a real value ( more generally labels are continuous)


The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations.


Some popular examples of supervised learning algorithms are:

  • Linear regression for regression problems.
  • Random forest or Artificial Neural Networks for classification and regression problems.


1 2

1 2


Unsupervised Learning


When we just have only input data and no corresponding output variables and the goal is to model the underlying structure or distribution in the data in order to learn more about the data is called Unsupervised Learning, more formally can be written as:

Let X=(x1, … , xn) be a set of n points where  xi belongs to X for all i  in 1, …,n.

Typically, we assume that the points are independently and identically distributed (i.i.d) from a common distribution of X. The objective consists on estimating a density likely generated from X.

Unsupervised learning problems can be further grouped into clustering and association problems.

  • Clustering and dimensionality reduction: The goal of clustering is to discover the inherent groupings in the data
  • Association:  Association tries to discover rules that describe large portions of data
  • Quantile estimation

Some popular examples of unsupervised learning algorithms are:

  • k-means for clustering problems
  • Apriori algorithm for association rule learning problems
  • Principal Component Analysis


2.jpg 4

2.jpg 4


Semi-supervised Learning

Semi-supervised learning is halfway between supervised and unsupervised learning. Algorithm is provided with some supervision information but not for all examples (some data are unlabelled).


Dataset can be divided into two parts:

  • Xi = (x1, … , xl) for which labels Yl= (y1, …, yl) are provided
  • Xu = (xl+1, … x l+u) labels not known


One of the most interesting of semi-supervised learning is image classification where only some data is labelled.



Reinforcement Learning


Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

The main example of reinforcement learning is Robotics applications where the robot is known as an agent, and is in some environment (surrounding). At each time step, it can take some action and it might receive some reward

3 1

3 1

A gentle overview on the Deep Learning and Machine Learning

Authors: Matteo Testi


During the last years, a buzz word arose in the field of Artificial Intelligence: Deep Learning”. Recently, there is a great interest in this kind of research, especially amongst business companies which are truly performing a “scavenger hunt” in order to find experts from Machine Learning and Deep Learning areas. These roles are increasingly being associated with the figure of the Data Scientist.

We can simply have a look at the development of “Deep Learning” by Google Trends, in time-slot ranging from 2014 to 2019.

google trend ML DL 2014 2019 1

Since the time DeepMind’s “Alpha-Go” software defeated South Korean Master Lee Se-dol in the board game Go earlier last year, the term “Artificial Intelligence” (AI) has become exponentially popular. The way the Deep Learning Engine within Alpha-Go worked consisted in combining the traditional Tree Navigation algorithm called “Monte-Carlo Tree Search” (MTS) with very deep “Convolutional Neural Networks” (CNN). Until then, MTS was a de-facto-standard for building record-breaking programs able to play Go game. However, the value-functions was still based on heuristic hand-crafted heuristic methods. The novelty introduced by DeepMind was the value-function inferred by a CNN trained on a Supervised Training Set before of million moves. Successively, a Deterministic Policy Gradients System based on an Actor-Critic Model was addressed to play against different versions of him-self for a long time. The result was a still unbeaten artificial GO player. In the popular Nature article named “Mastering the game of Go with deep neural networks and tree search,” DeepMind carefully describe all the Alpha-Go system.



The following diagram explains the difference in the Artificial Intelligence, Machine Learning and Deep Learning.


Machine Learning

Machine Learning (ML) is essentially a form of applied statistics meant to use computers to statistically estimate complex function. Mitchell in 1997 provides the following definition of Machine Learning: “A computer is said to learn from experience E concerning some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” Basically, ML is a set of techniques able to “learn” from data, and then make a decision or prediction about them. A Machine Learning system can be applied to “knowledge” from multiple sources to solve diverse tasks: facial classification, speech recognition, object recognition, etc. Unlike hard-coded algorithms, namely algorithms with specific instructions to solve a problem, Machine Learning enables a computer to learn how to recognize patterns on its own and make predictions.




The Machine Learning can be adapted for three different types of tasks:

  • Classification
  •  Clustering
  • Prediction

One of the most popular applications of machine learning has been Computer Vision, for many years. Most machine learning algorithms can be divided into the categories of supervised learning and unsupervised learning according to the training set comes supervised (with trainer’s external information or labels) or not supervised.

In Supervised Learning, labels are made by human to enable the machine to find out the relationship between input and labels



In Unsupervised learning, there are no labels available. In this situation, we are asking the computer to find clusters within the data.



Typical Machine learning algorithms are Random ForestLinear / Logistic Regression, Decision Tree, Support Vector Machines, PCA, K means, ICA, Naive Bayes etc.

Deep learning

The Deep Learning is a subarea of the Machine Learning that makes use of Deep Neural Networks (with many layers) and specific novel algorithms for the pre-processing of data and regularization of the model: word embeddings, dropout, data-augmentation. Deep Learning takes inspiration from Neuroscience since Neural Networks are a model of the neuronal network within the brain. Unlike the biological brain, where any neuron can connect to any other under some physical constraints, Artificial Neural Networks (ANNs) have a finite number of layers, connections, and fixed direction of the data propagation. So far, ANNs have been ignored by the research and business community. The problem is their computational cost.

Between 2006 and 2012 the research group led by Geoffrey Hinton at the University of Toronto finally parallelized the ANNs algorithms to parallel architectures. The main breakthroughs are the increased number of layers, neurons, and model parameters in general (even over than 10 million) allowing to compute massive amounts of data through the system to train it.



Therefore, the first requirement for training a Deep learning Model is having available a massive train-set. This makes Deep Learning a good fit for the Big Data age.



The reasons behind the popularity of Deep Learning are Big Data and Graphic Processing Unit (GPUs). Using a massive amount of data the algorithm and network learn how to accomplish goals and improve upon the process.



A deep learning algorithm could be instructed to “learn” what a dog looks like. It would take a massive data set of dog’s images to understand “features” that identify a dog and distinguish it from a wolf. We should keep in mind that, the Deep learning is also highly susceptible to bias. For example, in a supervised model, if the labels are made wrong, the model is going to learn from the mistakes.

When Google’s facial recognition system was initially rolled out, for instance, it tagged many black faces as gorillas.

“That’s an example of what happens if you have no African American faces in your training set” 

Said Anu Tewary, chief data officer for Mint at Intuit.

Deep Learning for Business

Deep learning affected business applications as never happened in Machine Learning before. It can deal with a huge amount of data—millions of images, for example—and recognise certain discriminative characteristics. Text-based searches, fraud detection, spam detection, handwriting recognition, image search, speech recognition, Street View detection, recommendation systems and translation are only some of the tasks that can be tackled by Deep Learning. At Google, Deep Networks have already replaced tens of “handcrafted rule-based systems”. Today, Deep Learning for Computer Vision already displays super-human capabilities, and that ranges from recognising common pictures like dog & cats to identifying cancer nodules in lung ct-scans.

The Take-Away

Machine Learning theory claims that an ML algorithm can generalise very well from a finite training set of examples. This contradicts the basic of logic: inferring general rules from a limited set of examples is not logically valid. In other words, to infer a rule describing every member of a set, we should have information about every member of that set. In part, ML addresses this problem with probabilistic rules rather than certain rules of logical reasoning. Unfortunately, this does not resolve the problem. According to the “No Free Lunch Theorem” (David Wolpert and William Macready, 1997): averaged over all possible data generating distributions, every classification algorithm shows the same error rate on unobserved data (test set). This means that the best ML algorithm cannot exist: so our goal must be to understand what kinds of distributions are relevant for the “real world” and what kind of ML algorithms perform well on the data drawn from distributions we focus on. In other words, Deep Learning is not universally better than Machine Learning, but it depends on the task domain. However, it seems that, in the future, Deep Learning will be likely solving many of our everyday computer, business, AI, marketing, etc.

As Andrew Yan-Tak Ng, former chief scientist at Baidu, where he led the company’s Artificial Intelligence Team says,

“AI is the new electricity”. 

We add: Deep Learning is the new light bulb.

We will deepen Deep Learning in the next tutorials, stay tuned…



Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. Nature529(7587), 484-489.












Stochastic Neighbor Embedding (SNE) and its correction in t-SNE

Author: Matteo Alberti




In this tutorial we are willing to face with a significant tool for the Dimensionality Reduction problem:

Stochastic Neighbor Embedding or just “SNE” as it is commonly called.

The problem definition is:

“We have a large dataset, and we want to find a way to reduce the dimensionality,  both for pre-processing and classification/visualization”. However we have a constraint:

There might exist ambiguous elements.

Therefore, the first question is:

What do we mean by ambiguity?


Example We are dealing with a text classification problem and we address with the following word:


We are focused on binding this word to others like “finance” or “money” but it might arise that it is related to “river” (namely, bank river) in the same time.

The most popular common dimensionality reduction tools Givesuch as:

  • Multidimensional scaling (MDS)
  • PCA (Principal Component Analysis).
  • Local Geometry preserved tools (Local Linear Embedding)

Given that, every object from an high-dimensional space, can be associated with only one object within a low-dimensional space.

SNE tries to map high-dimensional into low-dimensional objects by preserving the neighbourhood relationship structure in spite of a trade-off consisting of a misclassification around the far objects. SNE has nothing to do with  non-dissimilarity measures but it is grounded on the concept of Entropy and Probability Divergences.

The main idea is to centre a Gaussian distribution for each input value (within the high dimensional space) in order to use its density to define a probability distribution of all neighbours. The aim is to approximate this probability distribution as much accurately as possible replicating the relations structure in a low dimensional space.

Let us introduce the appropriate mathematical formalism:

Given X1….Xn ∈ RK , we intend to to build a mapping function (?) ∈ R2.

We are going to centre a Gaussian distribution for each value and associate a probability to them.


This way, we wish to have a look at what we see during the transition from Rk to R2  as much as possible.

 R^k\{x_1, ... , x_k\} \frac{\varphi} {\to} \{Y_1, ..., Y_n\} \in R^2 \\ ... \\ R^k\{x_1, ... , x_k\} \leftrightarrow \{Q_1, ..., Q_n\} \in R^2


This relation occurs whenever {P1…Pk} and {Q1…Qn} are as similar as possible.

From this, our goal function (K-L) is to minimize:

C= \sum_{i=1}^k D(P_i||Q_i) = \sum_{i,j} P_{i|j}ln(\frac{P_{j|i}}{Q_{j|i}})


Explanatory note:

Kullback-Leibler Divergence or  Cross-Entropy:

Given two probability distribution :

Pi… Pk  and Qi…Qk

We define Kullback-Leibler Divergence as:

With the following properties:

1.      D(P||Q) ≥ 0

2.      D(P||Q) = 0                 ↔ Q=P

3.      D(P||Q) ≠ D(Q||P)      Not symmetrical


What is the meaning?

  •  If Pi increases the object is closer
  • If Qi increases while Pi decreases means approaching two far objects in Rk (not good!) But it’s a negligible error because our Qi is weighted by Pj|i. This is way SNE preserve local neighbour structure despite global one.

How does c change while we move within the space? How can we move?

This can be done by a study of partial derivatives:

Firstly, let us remind that the density of points is given by σ (Gaussian dilation and Smoothing parameter). This means that our values look further or closer.


How can we choose σ?

We have to define a new object called Perplexity:

 Perplexity: 2H(Pi)

If Perplexity is constant then it enables us to vary the neighbours by mutating the only σ.


Problem: In case of high-dimensionality, far values are accumulated in a “block around”. This slippage is due to our willingness to adequately represent the neighbour’s structure.


Resolution: its correction in t-SNE

Therefore, we are willing to remedy this concentration of miss-classification error. In order to do so, we will operate on the overall structure of the Gaussian distribution. The main insight is very simple. We intend to raise the queues of the distribution. By doing so, the originally close points (next to one) will be slipped away in a negligible factor. On the other hand, far values will undergo a considerable sliding, distancing them from each other.





Let us make use of it with Python

We just have to recall Scikit library where it is implemented:

from sklearn.manifold import TSNE

t-SNE  (we are going to discuss the most relevant parameters)

TSNE(n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5)



  • n_components: default (equal to 2), it is the number of the ending dimensions.
  • Perplexity: It is the main parameter. The higher perplexity, the more the the number of neighbours. A good rule is setting it up between 5-50.

Pay attention!


If Perplexity is too small we will not capture enough neighbours; on the contrary, an excessive number will approach values too far. With large datasets, an enhancement of this parameter is recommended.


  • learning_rate: we suggest a range between 10 and 1000.

but how should we initialize it correctly in such a huge range?

Let us consider two cases:

 Former: Data concentrates in an almost equidistant space from each other. That means that an overly large value arises.

 Second: Strong Data concentration with only some outliers. The parameter is set too low.



Now we want to apply it to real data:

Dataset MNIST, consisting of over 60,000 images of handwritten digits.


Le us import the required packages and data.

%pylab inline
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA

# Load MNIST dataset
mnist = fetch_mldata("MNIST original")
X, y = mnist.data / 255.0, mnist.target

We make a first dimension reduction with PCA: useful when there are many compressed data, whereas, by spare data, SVD can be the best choose.

indices = arange(X.shape[0])
n_train_samples = 5000
X_pca = PCA(n_components=50).fit_transform(X)
X_train = X_pca[indices[:n_train_samples]]
y_train = y[indices[:n_train_samples]]

Le us start t-SNE, perplexity has been set at 40 after some attempts.

X_train_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_train)

Now let us depict our results:

matplotlib.rc('font', **{'family' : 'sans-serif',
                         'weight' : 'bold',
                         'size'   : 18})
matplotlib.rc('text', **{'usetex' : True})

def plot_mnist(X, y, X_embedded, name, min_dist=10.0):
    fig = figure(figsize=(10, 10))
    ax = axes(frameon=False)
    title("\\textbf{MNIST dataset} -- Two-dimensional "
          "embedding of 70,000 handwritten digits with %s" % name)
    setp(ax, xticks=(), yticks=())
    subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.9,
                    wspace=0.0, hspace=0.0)
    scatter(X_embedded[:, 0], X_embedded[:, 1],
            c=y, marker="x")

    if min_dist is not None:
        from matplotlib import offsetbox
        shown_images = np.array([[15., 15.]])
        indices = arange(X_embedded.shape[0])
        for i in indices[:5000]:
            dist = np.sum((X_embedded[i] - shown_images) ** 2, 1)
            if np.min(dist) < min_dist:
            shown_images = np.r_[shown_images, [X_embedded[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(X[i].reshape(28, 28),
                                      cmap=cm.gray_r), X_embedded[i])
plot_mnist(X[indices[:n_train_samples]], y_train, X_train_embedded,"t-SNE", min_dist=20.0)


t-SNE got in 17.51s


If we want to compare it with other dimensionality reduction techniques (as mentioned earlier), there are two critical issues to explain:


  1. Effective discrimination ability

Of course setting parameters correctly is not always easy. Hence, here you have a link showing how to modify the main parameters and display the data layout in real time: remind that the data you will find here are simulated and well known.



  1. Execution time:

In spite of other tools, t-SNE requires a greater number of time and resources. As you can see, while PCA takes about 0.01, t-SNE is 17s slower. We have to evaluate this issue if we coondised “time-lapse” as a crucial factor.

However, it is possible to provide some insights about how to tackle this problem. Here you are with two articles from the Computer Science Department at the University of California


Fast Optimization for t-SNE, Laurens van der Maaten (University of California)


Accelerating t-SNE using Tree-Based Algorithms, Laurens van der Maaten (University of California)





A manifold-based tools: ISOMAP algorithm

Author: Matteo Alberti


In this tutorial, we want to open a miniseries dedicated to manifold-based dimensionality reductions tools. So let’s start by understanding what a Manifold is and when it is important without deepening the underlying mathematics.

“manifold is a mathematical space that on small scale resembles the Euclidean space of a specific dimension”

So Why and When do we need a manifold?

The answer is simple: When linear methods fail



In linear methods (like PCA, MDS) data can only be summarized by a linear combination of features, this means that they cannot discover for example some 1D structures.

Let me show with this figure:

Linear Methods cannot discover more complex structures of data and reduce by it we will lose the geometry of data (and information)

manifold-Based tools otherwise succeed


  • Local Linear Embedding (LLE)

This because our goal is to “preserve local geometry”


The results in essentials in many applications (in computer vision it’s helpful for pose estimation, data compression, image de-noising and missing data interpolation)




So let’s start with the first tools: Isometric feature mapping

This algorithm has three main steps:

  • Determines which points are neighbours on a manifold based on distance (Euclidean distance) For each point, we connect all points within a fixed radius (where we have to choose radius) or like KNN (K nearest neighbouring algorithm) we have to choose K number of neighbours.

All neighbourhood relations are represented as a weighted graph

  • The second step, we will go to estimates the geodesic distance between all pairs of points. But while we are working on a manifold the shortest distance is given by the shortest path in the graph (for example Djkstra’s algorithm also used in routing/navigation)
  • In this last step, we will go to apply MDS (multidimensional scaling) to the matrix of graph distances due to constructing an embedding of the data in a d-dimensional space (Euclidean space) where we have preserved the manifold’s geometry. The coordinate vectors in the Euclidean-Space are chosen due to minimizing the cost function:

Where Dy is the Euclidean-based matrix of distances and L2 corresponds to the square of the sum of elements. τ is a function that converts distances to inner products (due to support efficient optimization)

The global minimum is achieved setting Euclidean-space’s coordinates to the top d eigenvectors of the matrix


Let me implement:

First of all, we have to import packages from scikit-learn

import numpy as np

import matplotlib.pyplot as plt

from sklearn import manifold


Isomap Class:

OUR_DATA= manifold.Isomap(n_neighbors=N, n_components=C, eigen_solver=’auto’, tol=0, max_iter=None, path_method=’auto’, neighbors_algorithm=’auto’, n_jobs=1)


We want to describe the main parameters:

  • n_neighbors= must be an integer, is the number of neighbours (like k-NN)
  • n_components= must be an integer, is the number of coordinators for manifold usually 2
  • path_method= auto is the best choice, but you can also set with FW for Floyd-Warshall or D for ours Dijkstra’s algorithm


If we want to plot our result:


from time import time


ISOMAP_DATA= manifold.Isomap(n_neighbors, n_components=2).fit_transform(OUR_DATA)


print("%s: %.2g sec" % ('ISO' t1-t0))

plot.embedding(ISOMAP_DATA, "ISOMAP ottenuto in " % (t1-t0)


Some considerations:

ISOMAP is a strong tool, but we have to do some considerations. While we can preserve the local structure (local geometry of data) this is an expense of two factors:

  • Sensitive to noise
  • Few free parameters

This because the functionality of the algorithm depends almost on the choice of the radius. This means that just a few outliers can break the mapping.

For these reasons ISOMAP highly suggests when we just have an idea of the structure of the data even because results computationally expensive (less than Entropy-based models)



http://scikit-learn.org/stable/index.html (Official Website of Scikit libraries)

http://web.mit.edu/cocosci/Papers/sci_reprint.pdf (documentation about ISOMAP)

Linear Dimensionality Reduction: Principal Component Analysis

Author: Matteo Alberti




Among all tools for the linear reduction of dimensionality PCA or Principal Components Analysis is certainly the main tools of Statistical Machine Learning.

Although we focus very often on non-linearity, the analysis of the principal components is the starting point for many analysis (also the core of preprocessing), and their knowledge becomes imperative in case the conditions on linearity are satisfied.

In this tutorial we are going to introduce at the mathematics level the extraction of PC, their implementation with python but above all their interpretation.


This is done by dividing the total variance into an equal number of starting variables than it will be possible to reduce the number based on the contribution that each Principal Component provides in the construction of our total variance.

We would like to remind you that the application of the PCA is useful when the starting variables are not independent


Let’s introduce them to the correct mathematical formalism:

Given a set of p quantitative variables X1, X2,. . . , Xp (centred or standardised variables) we want to determine a new set of k variables t.c k≤p indicated with Ys (s = 1,2, … k) that have the following properties:


uncorrelated, reproduce the largest possible portion of remaining variance following the construction of the first s-1 components (increasing order) and average equal to zero.

As a result, the linear combination will be:

We must, therefore, find the coefficients v that satisfy these constraints. This is a problem of maximum constraint where the first is called Normalization:

Our system becomes:

Where Variance can be written as:

And we can solve with Lagrange multiplier:

Calculate the gradient of L1 and its annulment:

The system : admits infinite solutions (which respect the constraint) by lowering the rank of the system coefficient matrix

which correspond to λs called eigenvalues of Σ.


Similarly, for the construction of the second PCA (and so for all the others) the Orthogonality Constraint replaces our system, given by our request that the PCs be uncorrelated, expressed as follows:

Than by setting the function of Lagrange in p + 2 variables:

From which we obtain the second eigenvalue (Y2) where we remember the following property:

Principal Component proprieties:


Each eigenvalue of Σ has a role in the variance of the respective PC

positive semidefinite

Total Variance

Generalized Variance



For the choice of the number k (with k <p) of PC to be maintained in analysis there is no universally accepted and valid criterion. It is therefore good practice to use them jointly and always keep the needs of the analysis in mind. We want to expose the main ones:



Cumulative percentage of total variance reproduced



absolute misure                                Normalization of Var                  % cumulative

set a threshold T on the cumulative percentage and keep the first k PC in the analysis that guarantees the exceeding of the threshold



It represents the eigenvalues concerning the order number of the component

Where the first k PC is selected based on the slope reduction. In this specific case, the PCs to be kept in analysis would be the first two.




Kaiser Criterion

Eigenvalue criterion greater than 1 (valid only for standardized variables)



Let’s go to implement with Python:

We have to import the necessary packages from scikit-learn

import numpy as np

from sklearn.decomposition import PCA

The class has the following attributes:

Sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None)

We want to comment the main parameters:

  • n_components = number of components to be analyzed.
  • svd_solver = gives us some of the main alternatives. We want to remember that PCA does not support sparse data (for which you will need to load TruncatedSVD)


We are going to test it on new real data, we predict for example on the Wine data that can be imported through the script:

from sklearn import datasets

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn import decomposition


wine = datasets.load_wine()

X = wine.data

y = wine.target

fig = plt.figure(1, figsize=(5, 4))


ax = Axes3D(fig, rect=[1, 0, 1, 0.9], elev=30, azim=222)


pca = decomposition.PCA(n_components=None)


X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:

ax.text3D(X[y == label, 0].mean(),

X[y == label, 1].mean() + 1.5,

X[y == label, 2].mean(), name,

bbox=dict(alpha=.5, edgecolor='b', facecolor='w'))

# Reorder the labels to have colors matching the cluster results

y = np.choose(y, [1, 2, 0]).astype(np.float)

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral,






This will our result: