Icone color1 07

Stochastic Neighbor Embedding (SNE) and its correction in t-SNE

Author: Matteo Alberti

 

 

 

In this tutorial we are willing to face with a significant tool for the Dimensionality Reduction problem:

Stochastic Neighbor Embedding or just “SNE” as it is commonly called.

The problem definition is:

“We have a large dataset, and we want to find a way to reduce the dimensionality,  both for pre-processing and classification/visualization”. However we have a constraint:

There might exist ambiguous elements.

Therefore, the first question is:

What do we mean by ambiguity?

 

Example We are dealing with a text classification problem and we address with the following word:

Bank.

We are focused on binding this word to others like “finance” or “money” but it might arise that it is related to “river” (namely, bank river) in the same time.

The most popular common dimensionality reduction tools Givesuch as:

  • Multidimensional scaling (MDS)
  • PCA (Principal Component Analysis).
  • Local Geometry preserved tools (Local Linear Embedding)

Given that, every object from an high-dimensional space, can be associated with only one object within a low-dimensional space.

SNE tries to map high-dimensional into low-dimensional objects by preserving the neighbourhood relationship structure in spite of a trade-off consisting of a misclassification around the far objects. SNE has nothing to do with  non-dissimilarity measures but it is grounded on the concept of Entropy and Probability Divergences.

The main idea is to centre a Gaussian distribution for each input value (within the high dimensional space) in order to use its density to define a probability distribution of all neighbours. The aim is to approximate this probability distribution as much accurately as possible replicating the relations structure in a low dimensional space.

Let us introduce the appropriate mathematical formalism:

Given X1….Xn ∈ RK , we intend to to build a mapping function (𝛗) ∈ R2.

We are going to centre a Gaussian distribution for each value and associate a probability to them.

 

This way, we wish to have a look at what we see during the transition from Rk to R2  as much as possible.

 R^k\{x_1, ... , x_k\} \frac{\varphi} {\to} \{Y_1, ..., Y_n\} \in R^2 \\ ... \\ R^k\{x_1, ... , x_k\} \leftrightarrow \{Q_1, ..., Q_n\} \in R^2

 

This relation occurs whenever {P1…Pk} and {Q1…Qn} are as similar as possible.

From this, our goal function (K-L) is to minimize:

C= \sum_{i=1}^k D(P_i||Q_i) = \sum_{i,j} P_{i|j}ln(\frac{P_{j|i}}{Q_{j|i}})

 

Explanatory note:


Kullback-Leibler Divergence or  Cross-Entropy:

Given two probability distribution :

Pi… Pk  and Qi…Qk

We define Kullback-Leibler Divergence as:

With the following properties:

1.      D(P||Q) ≥ 0

2.      D(P||Q) = 0                 ↔ Q=P

3.      D(P||Q) ≠ D(Q||P)      Not symmetrical


 

What is the meaning?

  •  If Pi increases the object is closer
  • If Qi increases while Pi decreases means approaching two far objects in Rk (not good!) But it’s a negligible error because our Qi is weighted by Pj|i. This is way SNE preserve local neighbour structure despite global one.

How does c change while we move within the space? How can we move?

This can be done by a study of partial derivatives:

Firstly, let us remind that the density of points is given by σ (Gaussian dilation and Smoothing parameter). This means that our values look further or closer.

 

How can we choose σ?

We have to define a new object called Perplexity:

 Perplexity: 2H(Pi)

If Perplexity is constant then it enables us to vary the neighbours by mutating the only σ.

 

Problem: In case of high-dimensionality, far values are accumulated in a “block around”. This slippage is due to our willingness to adequately represent the neighbour’s structure.

 

Resolution: its correction in t-SNE

Therefore, we are willing to remedy this concentration of miss-classification error. In order to do so, we will operate on the overall structure of the Gaussian distribution. The main insight is very simple. We intend to raise the queues of the distribution. By doing so, the originally close points (next to one) will be slipped away in a negligible factor. On the other hand, far values will undergo a considerable sliding, distancing them from each other.

 

 

 

 

Let us make use of it with Python

We just have to recall Scikit library where it is implemented:

from sklearn.manifold import TSNE

t-SNE  (we are going to discuss the most relevant parameters)

TSNE(n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5)

 

 

  • n_components: default (equal to 2), it is the number of the ending dimensions.
  • Perplexity: It is the main parameter. The higher perplexity, the more the the number of neighbours. A good rule is setting it up between 5-50.

Pay attention!

 

If Perplexity is too small we will not capture enough neighbours; on the contrary, an excessive number will approach values too far. With large datasets, an enhancement of this parameter is recommended.

 

  • learning_rate: we suggest a range between 10 and 1000.

but how should we initialize it correctly in such a huge range?

Let us consider two cases:

 Former: Data concentrates in an almost equidistant space from each other. That means that an overly large value arises.

 Second: Strong Data concentration with only some outliers. The parameter is set too low.

 

 

Now we want to apply it to real data:

Dataset MNIST, consisting of over 60,000 images of handwritten digits.

 

Le us import the required packages and data.

%pylab inline
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA

# Load MNIST dataset
mnist = fetch_mldata("MNIST original")
X, y = mnist.data / 255.0, mnist.target

We make a first dimension reduction with PCA: useful when there are many compressed data, whereas, by spare data, SVD can be the best choose.

indices = arange(X.shape[0])
random.shuffle(indices)
n_train_samples = 5000
X_pca = PCA(n_components=50).fit_transform(X)
X_train = X_pca[indices[:n_train_samples]]
y_train = y[indices[:n_train_samples]]

Le us start t-SNE, perplexity has been set at 40 after some attempts.

X_train_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_train)

Now let us depict our results:

matplotlib.rc('font', **{'family' : 'sans-serif',
                         'weight' : 'bold',
                         'size'   : 18})
matplotlib.rc('text', **{'usetex' : True})

def plot_mnist(X, y, X_embedded, name, min_dist=10.0):
    fig = figure(figsize=(10, 10))
    ax = axes(frameon=False)
    title("\\textbf{MNIST dataset} -- Two-dimensional "
          "embedding of 70,000 handwritten digits with %s" % name)
    setp(ax, xticks=(), yticks=())
    subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=0.9,
                    wspace=0.0, hspace=0.0)
    scatter(X_embedded[:, 0], X_embedded[:, 1],
            c=y, marker="x")

    if min_dist is not None:
        from matplotlib import offsetbox
        shown_images = np.array([[15., 15.]])
        indices = arange(X_embedded.shape[0])
        random.shuffle(indices)
        for i in indices[:5000]:
            dist = np.sum((X_embedded[i] - shown_images) ** 2, 1)
            if np.min(dist) < min_dist:
                continue
            shown_images = np.r_[shown_images, [X_embedded[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(X[i].reshape(28, 28),
                                      cmap=cm.gray_r), X_embedded[i])
            ax.add_artist(imagebox)
plot_mnist(X[indices[:n_train_samples]], y_train, X_train_embedded,"t-SNE", min_dist=20.0)

 

t-SNE got in 17.51s

 

If we want to compare it with other dimensionality reduction techniques (as mentioned earlier), there are two critical issues to explain:

 

  1. Effective discrimination ability

Of course setting parameters correctly is not always easy. Hence, here you have a link showing how to modify the main parameters and display the data layout in real time: remind that the data you will find here are simulated and well known.

https://distill.pub/2016/misread-tsne/

 

  1. Execution time:

In spite of other tools, t-SNE requires a greater number of time and resources. As you can see, while PCA takes about 0.01, t-SNE is 17s slower. We have to evaluate this issue if we coondised “time-lapse” as a crucial factor.

However, it is possible to provide some insights about how to tackle this problem. Here you are with two articles from the Computer Science Department at the University of California

 

Fast Optimization for t-SNE, Laurens van der Maaten (University of California)

http://cseweb.ucsd.edu/~lvdmaaten/workshops/nips2010/papers/vandermaaten.pdf

Accelerating t-SNE using Tree-Based Algorithms, Laurens van der Maaten (University of California)

http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf

 

References:

 

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *