Stochastic Neighbor Embedding (SNE) and its correction in tSNE
Author: Matteo Alberti
In this tutorial we are willing to face with a significant tool for the Dimensionality Reduction problem:
Stochastic Neighbor Embedding or just “SNE” as it is commonly called.
The problem definition is:
“We have a large dataset, and we want to find a way to reduce the dimensionality, both for preprocessing and classification/visualization”. However we have a constraint:
There might exist ambiguous elements.
Therefore, the first question is:
What do we mean by ambiguity?
Example We are dealing with a text classification problem and we address with the following word:
Bank.
We are focused on binding this word to others like “finance” or “money” but it might arise that it is related to “river” (namely, bank river) in the same time.
The most popular common dimensionality reduction tools Givesuch as:
 Multidimensional scaling (MDS)
 PCA (Principal Component Analysis).
 Local Geometry preserved tools (Local Linear Embedding)
Given that, every object from an highdimensional space, can be associated with only one object within a lowdimensional space.
SNE tries to map highdimensional into lowdimensional objects by preserving the neighbourhood relationship structure in spite of a tradeoff consisting of a misclassification around the far objects. SNE has nothing to do with nondissimilarity measures but it is grounded on the concept of Entropy and Probability Divergences.
The main idea is to centre a Gaussian distribution for each input value (within the high dimensional space) in order to use its density to define a probability distribution of all neighbours. The aim is to approximate this probability distribution as much accurately as possible replicating the relations structure in a low dimensional space.
Let us introduce the appropriate mathematical formalism:
Given X_{1}….X_{n} ∈ R^{K} , we intend to to build a mapping function (?) ∈ R^{2}.
We are going to centre a Gaussian distribution for each value and associate a probability to them.
This way, we wish to have a look at what we see during the transition from R^{k} to R^{2 }as much as possible.
This relation occurs whenever {P1…Pk} and {Q1…Qn} are as similar as possible.
From this, our goal function (KL) is to minimize:
Explanatory note:
KullbackLeibler Divergence or CrossEntropy: Given two probability distribution : P_{i}… P_{k} and Q_{i}…Q_{k} We define KullbackLeibler Divergence as: With the following properties: 1. D(PQ) ≥ 0 2. D(PQ) = 0 ↔ Q=P 3. D(PQ) ≠ D(QP) Not symmetrical

What is the meaning?
 If Pi increases the object is closer
 If Qi increases while Pi decreases means approaching two far objects in R^{k} (not good!) But it’s a negligible error because our Qi is weighted by P_{ji}. This is way SNE preserve local neighbour structure despite global one.
How does c change while we move within the space? How can we move?
This can be done by a study of partial derivatives:
Firstly, let us remind that the density of points is given by σ (Gaussian dilation and Smoothing parameter). This means that our values look further or closer.
How can we choose σ?
We have to define a new object called Perplexity:
Perplexity: 2H(Pi)
If Perplexity is constant then it enables us to vary the neighbours by mutating the only σ.
Problem: In case of highdimensionality, far values are accumulated in a “block around”. This slippage is due to our willingness to adequately represent the neighbour’s structure.
Resolution: its correction in tSNE
Therefore, we are willing to remedy this concentration of missclassification error. In order to do so, we will operate on the overall structure of the Gaussian distribution. The main insight is very simple. We intend to raise the queues of the distribution. By doing so, the originally close points (next to one) will be slipped away in a negligible factor. On the other hand, far values will undergo a considerable sliding, distancing them from each other.
Let us make use of it with Python
We just have to recall Scikit library where it is implemented:
from sklearn.manifold import TSNE</span></td> 
tSNE (we are going to discuss the most relevant parameters)
TSNE(n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5)

 n_components: default (equal to 2), it is the number of the ending dimensions.
 Perplexity: It is the main parameter. The higher perplexity, the more the the number of neighbours. A good rule is setting it up between 550.
Pay attention!
If Perplexity is too small we will not capture enough neighbours; on the contrary, an excessive number will approach values too far. With large datasets, an enhancement of this parameter is recommended.
 learning_rate: we suggest a range between 10 and 1000.
BUT how should we initialize it correctly in such a huge range?
Let us consider two cases:
Former: Data concentrates in an almost equidistant space from each other. That means that an overly large value arises.
Second: Strong Data concentration with only some outliers. The parameter is set too low.
Now we want to apply it to real data:
Dataset MNIST, consisting of over 60,000 images of handwritten digits.
Le us import the required packages and data.
%pylab inline from sklearn.datasets import fetch_mldata from sklearn.decomposition import PCA # Load MNIST dataset mnist = fetch_mldata("MNIST original") X, y = mnist.data / 255.0, mnist.target 
We make a first dimension reduction with PCA: useful when there are many compressed data, whereas, by spare data, SVD can be the best choose.
indices = arange(X.shape[0]) random.shuffle(indices) n_train_samples = 5000 X_pca = PCA(n_components=50).fit_transform(X) X_train = X_pca[indices[:n_train_samples]] y_train = y[indices[:n_train_samples]]
Le us start tSNE, perplexity has been set at 40 after some attempts.
X_train_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_train)

Now let us depict our results:
matplotlib.rc('font', **{'family' : 'sansserif', 'weight' : 'bold', 'size' : 18}) matplotlib.rc('text', **{'usetex' : True}) def plot_mnist(X, y, X_embedded, name, min_dist=10.0): fig = figure(figsize=(10, 10)) ax = axes(frameon=False) title("\\textbf{MNIST dataset}  Twodimensional " "embedding of 70,000 handwritten digits with %s" % name) setp(ax, xticks=(), yticks=()) subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.9, wspace=0.0, hspace=0.0) scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, marker="x") if min_dist is not None: from matplotlib import offsetbox shown_images = np.array([[15., 15.]]) indices = arange(X_embedded.shape[0]) random.shuffle(indices) for i in indices[:5000]: dist = np.sum((X_embedded[i]  shown_images) ** 2, 1) if np.min(dist) < min_dist: continue shown_images = np.r_[shown_images, [X_embedded[i]]] imagebox = offsetbox.AnnotationBbox( offsetbox.OffsetImage(X[i].reshape(28, 28), cmap=cm.gray_r), X_embedded[i]) ax.add_artist(imagebox)
plot_mnist(X[indices[:n_train_samples]], y_train, X_train_embedded, "tSNE", min_dist=20.0)

tSNE got in 17.51s
If we want to compare it with other dimensionality reduction techniques (as mentioned earlier), there are two critical issues to explain:
 Effective discrimination ability
Of course setting parameters correctly is not always easy. Hence, here you have a link showing how to modify the main parameters and display the data layout in real time: remind that the data you will find here are simulated and well known.
https://distill.pub/2016/misreadtsne/
 Execution time:
In spite of other tools, tSNE requires a greater number of time and resources. As you can see, while PCA takes about 0.01, tSNE is 17s slower. We have to evaluate this issue if we coondised “timelapse” as a crucial factor.
However, it is possible to provide some insights about how to tackle this problem. Here you are with two articles from the Computer Science Department at the University of California
Fast Optimization for tSNE, Laurens van der Maaten (University of California)
http://cseweb.ucsd.edu/~lvdmaaten/workshops/nips2010/papers/vandermaaten.pdf
Accelerating tSNE using TreeBased Algorithms, Laurens van der Maaten (University of California)
http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf
References:
 http://scikitlearn.org/stable/index.html (Official Website of Scikit libraries)
 http://www.cs.toronto.edu/~fritz/absps/sne.pdf (documentation about SNE)
 http://www.cs.toronto.edu/~hinton/absps/tsne.pdf (documentation about tSNE)