Author: Matteo Alberti
In this tutorial, we want to open a miniseries dedicated to manifold-based dimensionality reductions tools. So let’s start by understanding what a Manifold is and when it is important without deepening the underlying mathematics.
“manifold is a mathematical space that on small scale resembles the Euclidean space of a specific dimension”
So Why and When do we need a manifold?
The answer is simple: When linear methods fail
In linear methods (like PCA, MDS) data can only be summarized by a linear combination of features, this means that they cannot discover for example some 1D structures.
Let me show with this figure:
Linear Methods cannot discover more complex structures of data and reduce by it we will lose the geometry of data (and information)
manifold-Based tools otherwise succeed
- Local Linear Embedding (LLE)
This because our goal is to “preserve local geometry”
The results in essentials in many applications (in computer vision it’s helpful for pose estimation, data compression, image de-noising and missing data interpolation)
So let’s start with the first tools: Isometric feature mapping
This algorithm has three main steps:
- Determines which points are neighbours on a manifold based on distance (Euclidean distance) For each point, we connect all points within a fixed radius (where we have to choose radius) or like KNN (K nearest neighbouring algorithm) we have to choose K number of neighbours.
All neighbourhood relations are represented as a weighted graph
- The second step, we will go to estimates the geodesic distance between all pairs of points. But while we are working on a manifold the shortest distance is given by the shortest path in the graph (for example Djkstra’s algorithm also used in routing/navigation)
- In this last step, we will go to apply MDS (multidimensional scaling) to the matrix of graph distances due to constructing an embedding of the data in a d-dimensional space (Euclidean space) where we have preserved the manifold’s geometry. The coordinate vectors in the Euclidean-Space are chosen due to minimizing the cost function:
Where Dy is the Euclidean-based matrix of distances and L2 corresponds to the square of the sum of elements. τ is a function that converts distances to inner products (due to support efficient optimization)
The global minimum is achieved setting Euclidean-space’s coordinates to the top d eigenvectors of the matrix
Let me implement:
First of all, we have to import packages from scikit-learn
import numpy as np import matplotlib.pyplot as plt from sklearn import manifold
OUR_DATA= manifold.Isomap(n_neighbors=N, n_components=C, eigen_solver=’auto’, tol=0, max_iter=None, path_method=’auto’, neighbors_algorithm=’auto’, n_jobs=1)
We want to describe the main parameters:
- n_neighbors= must be an integer, is the number of neighbours (like k-NN)
- n_components= must be an integer, is the number of coordinators for manifold usually 2
- path_method= auto is the best choice, but you can also set with FW for Floyd-Warshall or D for ours Dijkstra’s algorithm
If we want to plot our result:
from time import time t0=time() ISOMAP_DATA= manifold.Isomap(n_neighbors, n_components=2).fit_transform(OUR_DATA) t1=time() print("%s: %.2g sec" % ('ISO' t1-t0)) plot.embedding(ISOMAP_DATA, "ISOMAP ottenuto in " % (t1-t0) plot.show()
ISOMAP is a strong tool, but we have to do some considerations. While we can preserve the local structure (local geometry of data) this is an expense of two factors:
- Sensitive to noise
- Few free parameters
This because the functionality of the algorithm depends almost on the choice of the radius. This means that just a few outliers can break the mapping.
For these reasons ISOMAP highly suggests when we just have an idea of the structure of the data even because results computationally expensive (less than Entropy-based models)
http://scikit-learn.org/stable/index.html (Official Website of Scikit libraries)
http://web.mit.edu/cocosci/Papers/sci_reprint.pdf (documentation about ISOMAP)