A Tutorial on Linear Algebra and Geometry ( Part 2)

 Author: Paolo Caressa


Points, vectors and their algebra

So far we dealt with points as identified with pairs (or triples for space) of real numbers: as far as Machine Learning is concerned, we are interested infinite sets of points, which represent particular objects to classify or phenomena to correlate as points in a Cartesian space. However, to find out some regular behaviour or pattern for those points, often we are interested in lines, planes and also more complex geometric figures, as conics or quadrics.

We have already seen, by an example, that some geometric figures may also be represented by single equations or systems of equations. Let us bound ourselves to lines and planes to understand in general how to represent them since often those representations are used in Machine Learning.


A line is, intuitively speaking, a set of collinear points: one of Euclid’s axioms claims that given two distinct points there is exactly one line passing through them so that to identify uniquely a line two of its points suffice.

To understand how to do that in the Cartesian plane, let us consider two points P ≠ Q, whose coordinates are P=(x1,y1) e Q=(x2,y2). Let us define the vector from P to Q as the pair of real numbers





Thus a vector is by definition the difference between two points, whereby “difference” we mean the coordinate-wise difference.


From the IT point of view, we may imagine a point as an array (or list) with two elements (which are numbers). Usually an array A = [x,y] is indexed starting by 0, thus A[0] = x and A[1] = y. Then, if A and B are arrays, the vector   \overline{AB}  may be represented by the array [B[0] – A[0], B[1] – A[1]].



Why do we call “vector” a difference of points? Because it is an object which has direction, sense and magnitude, arrays as vectors used in Physics.

Its direction is just the line passing through P and Q; its orientation is from P toward Q (the opposite vector would be    \overline{QP}  which runs from Q toward P: which are its coordinates?)

The magnitude is the distance between points P and Q. Therefore a vector determines a line, and a sense on it: of course there are infinite vectors sharing the same direction (the same line).


For example, the line passing through  P= (0,1)  and  Q= (2,-1)  is determined by the vector , but also by the vector  \overline{PQ} which has the same direction but different magnitude.

In general, for each number, a ≠ 0 the vector  (-a,a)  induces this line. We say that two vectors are paralleli if one is multiple of the other one by a non-zero scale factor.

Indeed, another way to determine a line is to pick one of its points   P and to provide a vector   \overline{V} so to be able to express all the line’s points as translated by vectors parallel to \overline{V}.


foto 14


This is called the parametric equation of the line since it expresses the generic point  on the line via a parameter  which varies through all real numbers.

Notice that we used the operation of the sum between a point and a vector, defined as follows:


foto 15



Although  vectors are useful to describe lines (but also planes and, in higher dimension, “hyperplanes”), they are interesting in themselves because of their algebraic properties, such as:

  1. Vectors may be added to get new vectors according to the rule

foto 16



foto 17

By a vector space, we mean a set of elements equipped with two operations, sum between elements and multiplication by a number, which satisfies the previous properties shared by vectors in the Cartesian plane as defined as differences of points.

Actually, in Machine Learning, one always uses finite-dimensional vector spaces whose vectors are expressed in coordinates, thus vectors are identified with n-tuples of numbers (their coordinates). This is a major element of confusion for beginners since both points and vectors in the Cartesian space are represented as a pair of numbers.


foto 18


Actually, in Machine Learning, one always uses finite-dimensional vector spaces whose vectors are expressed in coordinates, thus vectors are identified with n-tuples of numbers (their coordinates). This is a major element of confusion for beginners since both points and vectors in the Cartesian space are represented as a pair of numbers.

However, even if they are represented in the same way, points and vectors are conceptually different objects. A point identifies a single point in the space, while a vector identifies a displacement along a certain direction, with given sense and magnitude: points and vectors may be put in one-to-one correspondence as follows: To a point P it is associated the vector \overline{OP}  which starts from the origin and ends in P; to a vector  \overline{V} instead, we may associate the point 0 +\overline{V} .



This distinction between points and vectors is often overlooked but t is nevertheless important, also because it helps in understanding application of the theory: for example, in the following picture some words of a document corpus are represented, mapped by means of a Machine Learning algorithm to points in the plane:




These are points representing words. What about vectors in this case?

Let us take the vector which displaces “pizza” to “Italy” and let us apply it to the point “sushi”: we vet the point “Japan” (up to a certain approximation). We infer that that vector represents an answer to a question: “given a national dish which is the corresponding nation?”.

In this case, it is clear that points and vectors represent distinct concepts.





By applying the “arccosine” function, it is possible to compute this angle starting from the scalar product between two vectors and from their lengths: the geometrical meaning of angle for this quantity is explained via the trigonometric interpretation of the cosine function.



Recall that  cos\theta   is a number greater or equal to -1  and less or equal  1  to , such that:

  • If the two vectors have the same direction and the same sense then the cosine is equal to 1.
  • If the two vector has the same direction ut opposite since then the cosine is equal to -1 .
  • If the cosine is equal to zero, the two vectors are said to be orthogonal.


foto 21

Therefore, while the distance measures the nearness of two points, which the more are near the more their distance is zero, the cosine measures the similarity among directions and senses of the two vectors, so that the more the two vectors are aligned on the same line the more the absolute value of their cosine is near to one.

This cosine similarity is often employed in Machine Learning to classify objects in vector spaces.


The N-dimensional Cartesian space

So far, in the crash course on Cartesian geometry and linear algebra, we bounded ourselves to the case of dimension N=2 and N=3, so to be able to draw pictures and to develop the concept in a somewhat familiar environment: but in real life applications, the dimension of vector spaces may be quite high.

In those cases, we have to give up to geometric intuition, although all formulas and concept developed so far generalize trivially to the case of higher dimensions.

Indeed such formulas depend on sum and subtractions of points coordinates and vector components: if those coordinates are 2 or 20,000 it makes no real difference. Better still, all the theory is easily and efficiently implemented on a computer, which may deal with points and vectors in spaces of high dimensions without problems, since they are represented as arrays of numbers.

For example, let us consider the concept of a hyperplane: in the plane, this coincides with the concept of line and in the space with the concept of a plane.

Let us fix a dimension N and let us consider the Cartesian space RN.

A line in this space is, as in dimension 2, determined by a pair of points, or by a point and a vector: the parametric equation is just the same as in the dimension 2 case.

If N > 2 we may also consider parametric equations of the following form:


foto 22

In this case, we have two parameters which vary independently, so that, intuitively, the set of points X which satisfy this equation, when a and b vary in all real numbers, corresponds to the set of pairs (a,b) thus to the plane. In particular, it is a bidimensional object.


Actually this is not always true: for example if  \overline{v}= \overline{w} then the parametric equation becomes

foto 23

which actually describes a line.

Therefore, if we write a parametric equation with many parameters, the dimension of the set of points described by this equation depends on the relations between the vectors which appear in the parametric equation itself. In the previous case, if vectors   \overline{v}  and \overline{w}  are parallel then the equation represents a line and not a plane.

If in general, in the N-dimensional space, we write a parametric equation with N – 1 parameters, we get:


foto 24


foto 25

Matrices and their algebra

A major feature of linear algebra is the efficiency and universality of its numerical methods. Actually, it suffices to implement a single algorithm (or one of its variants), namely Gauss’ elimination, to be able to do practically everything in an effective way (solving equations, checking linear independence, etc.). These algorithms are available in each standard numerical computation library, such as Python’s numpy.linalg.

To close this tutorial (already far too long) it is worth to introduce the key notion which is involved in each of those algorithms, and which is also crucial in the conceptual developments of linear algebra: the concept of a matrix.

A matrix is just a table of numbers, and it may be considered as a bidimensional array. More formally, an n×m matrix is a table of numbers which are singly addressed by means of two indexes, i and j, where the first one addresses the row and the second one addresses the column. At the crossing of row I and column j, there is the number  which is pointed by those indexes (notice that in mathematics indexes usually runs from 1 on, not from 0 on as in computer science).

The tabular representation of a matrix is as follows:

foto 26

A matrix such as n = m is said to be a square matrix.

In practice, a matrix is just a vector of length nm whose elements are displayed by rows and not as a single sequence. However, this notation change is fundamental to use those objects.

In particular, matrices enrich vector algebra with a new operation, a multiplication. In the first place let us notice that we may add matrices and multiply them by a single number to get again matrices of the same type:

foto 27


Therefore, n×m matrices do form a vector space of dimension nm.

Recall that a vector may be multiplied by a number, to get a new vector and that two vectors may be multiplied to get a number (via the dot product). But we do not know how to multiply, say, a vector by itself to get a new vector.

If we write vector as matrices we can actually multiply them: indeed we may distinguish two kinds of vectors when written in matrix form, row vectors and column vectors. A row vector is a sequence of numbers, we have already met them, for example (1,2,0).


A column vector is a vector written from top to bottom as in


foto 28


At first sight, this is just a matter of notation, but actually, if we interpret a vector as a particular kind of matrix, a row vector is a 1×N matrix, while a column vector is an N×1 vector.

Now, given an n×m matrix A and n×r matrix B, we may multiply A by B to get a new n×r matrix. The entry in the matrix AB addressed by indexes i and j is defined as:


foto 29


Notice that this is the dot product of the row vector given by the i-th row in A by the column vector given by the j-th column in B.

Example: let us multiply a 2×3 matrix times a 2×3 matrix:



foto 30


Now we come back to vectors: we may multiply a row vector times a column vector to get a 1×1 matrix (which is just a number) and this is the dot product. But we can also multiply an N×1 column vector times a 1×N row vector as matrices and get an N×N matrix, as in:


foto 31


However in this way, we multiply two vectors belonging to N-dimensional spaces and we get a vector in a vector space with a different dimension, either 1 or N×N.

The identity matrix is the square matrix whose entries are zero but for the one on the diagonal which are 1 (diagonal elements in a matrix are the ones with the row index equal to the column index: a_i_i ). For example the 3,,×3 identity matrix is


foto 32


As the name suggests, on multiplying a matrix A times the identity matrix still we get A. Moreover, the matrix product is both associative and distributive with respect to matrix sum.

However, matrix algebra displays a particular and interesting feature: the matrix product is not commutative in general. Thus AB is generally different from BA (indeed BA may well be meaningless, for example, if n m).

For example:

foto 33



Another typical operation is the multiplication of an n×m matrix times a column vector with m components: the result is a column vector with n components.

Introduction to Dimensionality Reduction and Linear Algebra basics (part 2)

Author : Matteo Alberti



Dimensionality Reduction in a linear space. 1

Through the identification of a subspace. 1

Reduction through matrix approximations. 1

Basic case: Decomposition in Singular Values (SVD). 1

Matricial Norms. 1

Vectorial Norms. 1

Induced Norms. 1

Schattern Norms. 1

Frobenius Norms. 1

Base Cases: Cluster Analysis. 1

Definition of a Metrics. 1

Minkowski Distance (Manhattan, Euclidean, Lagrange).



Matricial Norms


At this point we have set the problem of dimensionality reduction of data as an approximation problem between matrices, we must now evaluate and then calculate the distance between the matrix of the original and the approximate data through the study of the different standards:

There are three main types of rules:

  • Vector norms
  • Induced norms
  • Schatten norms

Where we essentially refer, exceptions excluded, to the Frobenius norm (Euclidean distance)


Elements of algebra:


A standard (commonly marked with ‖ ∙ ‖) is a function from the vector space fo matrix if:


ecuation 18

Vectorial Norms


The vector norms family treats the  array as a vector of nk components where we can define the norm using any of the following rules:


ecuation 19





Setting p = 2 we are connected to the Euclidean norm


Induced Norms


An  X_n_x_k  matrix can be seen as a linear operator from   R^k\mapsto R^n .

Measuring in R^k  the lengths with a fixed norm and doing the same in  R^n , with a different norm, we can go to measure how  much X lengthens or shortens a vector   v \in  R^k , comparing the norm of v with the relative norm of his image Xv.

The induced norm  \left \| X \right \|_k  _n  is defined as:


ecuation 20


Schatten Norms


The Schatten norm, of order p, of an X matrix, is simply given by:


ecuation 21


Where  w_i are singular values


Frobenius Norms


The Frobenius norm of our  X_n_x_k  matrix is given by:


ecuation 22


Explicating the matrix product we obtain:


ecuation 23


It corresponds that the Frobenius norm is equal to the sum of square roots of the square.

A Euclidean norm is seen as a vector coincides with the vector rule of X of order 2.


Elements of algebra:




The trace operator, indicated by Tr (∙), is defined as the sum of the diagonal elements of the argument matrix



Base Cases: Cluster Analysis


Cluster analysis is a multivariate analysis technique through which it is possible to group statistical units, to minimize the internal “logical distance” to each group and to maximize the one between the groups.

It is among the unsupervised learning techniques.

It is therefore spontaneous to have to define what is meant by logical distance and based on which metric.




Definition of a Metrics


ecuation 24

If it contrarily enjoys only the first three properties, we can define it as an index of the distance


Minkowski Distance (Manhattan, Euclidean, Lagrange)


At this point we are going to analyze the main cases of distances belonging to the Minkowski distances family where:

ecuation 25


We highlight the following cases:


We highlight the following cases:

  • k=1  Manhattan distance
  •  k=2 Euclidean distance
  • k\to \infty  Lagrangian distance (Čebyšëv)

As we can see:


ecuation 26



Therefore, starting with the example of Cluster Analysis, it is essential to define the type of distance with which we want to deal with our analysis.

Mainly in the packages already implemented are the three variants of Minkowski distances (for quantitative variables)


Importing from sklearn:

AgglomerativeClustering(n_clusters=2, affinity=’euclidean’, memory=None, connectivity=None, compute_full_tree=’auto’, linkage=’ward’


ecuation 27

ecuation 27

Introduction to Dimensionality Reduction and Linear Algebra basics (Part 1)

Author: Matteo Alberti




Dimensionality Reduction in a linear space. 1

Through the identification of a subspace. 1

Reduction through matrix approximations. 1

Basic case: Decomposition in Singular Values (SVD). 1

Matricial Norms. 1

Vectorial Norms. 1

Induced Norms. 1

Schattern Norms. 1

Frobenius Norms. 1

Base Cases: Cluster Analysis. 1

Definition of a Metrics. 1

Minkowski Distance (Manhattan, Euclidean, Lagrange). 1



Dimensionality Reduction in a linear space


The aim of this first tutorial is to introduce the basic notions of dimensionality reduction from the mathematical point of view (spaces, subspaces, linear maps) and to retrieve the necessary elements of linear algebra (norms, isometry, isomorphism) for each machine learning algorithm.


Therefore, we are going to give a gentle introduction to what a linear reduction is. First of all, we need to identify the concept of “optimal” subspaces.


We have a set of k-dimensional vectors   x_1,..,x_n (n statistical units on k covariate) where we want to identify a p-dimensional subspace  V_p R^k in  and a linear map  \varphi (.)  from R^k   to  V_p such that the relative images  \varphi(x_i) sufficiently approximate our k-dimensional input vectors (in reference to some metric we will define)

We rewrite as:


ecuation 1

elements of algebra:

Vector space


We define a vector space on R a set V, whose elements are called vectors , with the following properties:

There exists in V a defined vector addition operation that associates with each   x,y\epsilon V the vector x+y\epsilon V

  • Vector addition is commutative and associative
  • There is a vector in V , set in 0 and defined as the origin
  • Each vector x\epsilon V has its opposite, indicated with –x t.c. + (-x)= 0

Exists in V a defined multiplication operation for the scalars which associates each x\epsilon V  and at each a\epsilon V  the vector  ax\epsilon V so that:

  • Multiplication between scalars is associative
  • 1x= x, \forall x \in V

It also applies:

  • Scalar multiplication is distributive concerning vector addition


Vector subspace

S is a non-empty subset of the subspace of V where for each  x-x= 0\in S  and every linear  combinations ax+\beta y\in S



Kernel and Image

Let V and W be two vector spaces  L: V\rightarrow W and be a linear application.

The kernel of L is the set of V vectors whose image is the null vector of W.

This set is indicated with ker L.


ecuation 2

The image of L is the set of the vectors of W which are images of some vector belonging to the domain V, that is:

ecuation 3

Linear Map

A linear map  f:V\mapsto W  between real vector spaces is a function with the following properties:



for every   \overline{u}, \overline{x} \in V  and  \lambda \in R 

We are going to define only the cases of main relevance:

Let  f: v\mapsto W  be a linear map, then f is said:

  • monomorphism if f is one-to-one
  • epimorphism if f is onto
  • isomorphism if f is bijective (injective + surjective)
  • endomorphism if V = W
  • automorphism if V = W and is bijective.
  • rank equal to r if r = dim f (V)

Suppose at this point that we have identified a subspace (specified later) v_p  that is sufficiently “approximate” and is  b_1,..., b_p a base  of   V_p (Note: the base is formed by k-dimensional vectors since is subspace of  R^k )

The linear map \varphi (.) associates to the input   the element    \varphi= (x_i) of  v_p , having the following form:

ecuation 5

a_i_j chosen appropriately.

The approximation error from  then from  to  will consist of the appropriate vector norm:

ecuation 6

Fixed our base  b_1,...,b_p  our vectors  \varphi(x_i) can be found directly from the coordinates  a_i_j .

Given the map  \varphi(.) that associates with the generic vector  \varphi(x_i) its vector of coordinates a_i_j and we compose with the map  \varphi (.) we obtain a map that associates the coordinates of their images with :


ecuation 7


ecuation 8

Through our map the k-dimensional input vectors are represented in p-dimensional vectors, i.e. as elements of R^p .

At this point, it is immediate that we can proceed with further analysis in R^p .


So let’s investigate how this reduction in dimensionality goes to keep of the original geometry (and to lose)

ecuation 8 1

ecuation 10

Going from k to p dimensions with   involves a loss of information and a deformation of the original geometry of the data. Also w,e have to remeber that any kind of linear maps can not be an isomorphism or an isometry.

This means   p\lt k that all rules, scalar products and distances are not preserved




v_p dimensional subspace


ecuation 11 1

orthogonal complement of R^k


Each vector can be decomposed into a sum of vectors (and a vector )

So, given two vectors  we have:


Each vector  x \in R^k  can be decomposed into a sum of vectors

ecuation 12

So, given two vectors  u,v , \in R^k  we have:

ecuation 13

As can be seen, the scalar product is not preserved.



Elements of algebra:

orthogonal complement

Let   S \subseteq V  be a subspace of V, we define orthogonal complement of D in V indicating it with tS^\bot   the subset of V defined by:


ecuation 14

That is a subset of all the vectors of orthogonal V to all the vectors of S

v_p\mapsto R^p


The linear map  \varphi\left ( . \right )  is an isomorphism the vector spaces  V_p  and  R^p  , this means that there will be no loss of information.

We consider the Euclidean product and suppose that the base  b_1 ....b_p   is orthonormal,  h,w  two vectors of  V_p  with the relative coordinate vectors:


ecuation 15

then results truly the following condition:


ecuation 16


It means that the product scalar between h, w is identical to the scalar product of the respective initial coordinate vectors (specify that the two scalar products are defined in different spaces although they have been written in the same way)


If the base was not orthonormal the factors  \left ( b_i...b_j \right )   would not be reduced to 0.

In the first case, the scalar product among the coordinate vectors is given by:  \left [ a^h \right ]^t I  a^w  while in the second case it would be given by  \left [ a^h \right ]^t G a^w  , with Gram matrix G it follows therefore that matrices built in the second way and associated with different bases are different from one another (having to readjust the calculation of the scalar products, norms, induced distances …)





Elements of algebra:



A base is defined as orthonormal when it is composed orthogonal vectors of unitary norms


Reduction through matrix approximations


We want to give a second vision to the dimensionality reduction based on the matrix approximation (what will be used in practice in the most of programming language)



imagen matriz 1

Then we can rewrite it as:


ecuation 17

The rows of θ are given by the linear combinations of the rows of B that is from our base, with the coefficients given by the rows of A, the coordinates on the chosen base.

Our problem dimensionality reduction, therefore, corresponds to identifying a vector subspace of dimension p (p <k) our chosen base (B) and of the relative coordinates given by the matrix A.


In data analysis, where our starting point is the data matrix, the different reduction techniques differ according to the type of approximation, decomposition and choice among the many possible bases.





Basic case: Decomposition in Singular Values (SVD)


Let’s implement in Python a simple decomposition in singular values (SVD) that is, let’s divide our starting matrix X into the two matrices A and B seen previously:

import numpy as np

X = np.array([3,2,4],[3,6,7],[2,1,4])

autoval, autovett = np.linalg.eig(X)





A Tutorial on Linear Algebra and Geometry ( Part 1)

 Author: Paolo Caressa





In machine learning and deep learning, as in any other interdisciplinary area, concepts and formalisms coming from different sources and fields are used, and often they require different mindsets to be actually understood and properly managed.

For this reason, one says usually that a data scientist should be a hybrid between a computer scientist, a statistician and a mathematician, but few people are actually well-versed in all of those three subjects, and it is unlikely that all those fields will be familiar, easy and intuitive for the same mind.

In particular, among the different mathematical notions which are truly needed to a data scientist, as for anyone seriously interested in understanding how a machine learning system works (or rather why it actually works), we find linear algebra.

This is quite surprising for many mathematicians, which consider linear algebra so easy to be userful only at a first approximation. But nevertheless, as I.M. Herstein wrote, linear algebra is just easy, useful and beautiful!

The aim of this tutorial is to explain, in a simple but precise way, the basic notions of linear algebra: thus, if you reader already know how to compute eigenvalues, or how to perform a singular value decomposition or how to compute a principal component analysis, this tutorial is not for you. Otherwise, enjoy reading!


The Cartesian plane and space

Let us start from the very beginning, and dig into our high school remembrances about Cartesian (or analytic) geometry.

This approach to geometry was devised in the XVIIth century by Descartes and Fermat, and the basic idea is to provide a numerical model both for the plane and space. Essentially, their discovery consists in identifying the points in the plane and in the space with respectively pairs and triples of numbers, so that the plane is one-to-one mapped into the set of pairs (x,y) of numbers where x and y vary independently and in all possible ways.

To begin with, let us recall that the points of the line correspond to the set of real numbers (thus rational and irrational numbers) totally ordered by the relation <.

Indeed, let us fix a point O on the line and let it correspond to 0, next let us fix a point U (different from O) on the line and let it correspond to 1. All other points are determined uniquely by their positions with respect to those two points:

foto 1



In particular, it is clear what we mean by the distance between two points P and Q on the line: if x and y are the real numbers corresponding to P and Q, the distance PQ is just |x – y| (absolute value of the difference):

foto 2


Therefore, on a line, once a point corresponding to 0 and a point corresponding to 1 are fixed we get also a direction for positive numbers: indeed O divides the line in twintosides, and positive numbers are the ones on the side of U, and negative ones are on the opposite side.


The line of Euclidean geometry corresponds one-to-one to the set R of real numbers

To each point it corresponds one and only one number, were we may think to a real number as either a decimal number with a finite number of digits after the decimal point, possibly repeating indefinitely, or a decimal number with infinite non periodic digits after the decimal dot (to be precise we should exclude a periodic repetition of the 9 digit).


Next, let us come to the plane.

Fix a point O in the plane and take the line x passing through it, and also a second line y, perpendicular to x and passing through O. On both x and y we choose the same measure unit (thus a point U different from O and state that the distance OU is 1).

Therefore we may measure the distance between two points on the same line x, as we did before.

The correspondence between a point P and a pair (a,b) of numbers is defined as follows:

  1. We project P on the x line, getting a point P’, and we put a = distance between P’ and O.
  2. We project P on the y line, getting a point P’’, and we put b = distance between P’’ and O.

In this way we uniquely associate P to the pair (a,b): to come back from coordinates (a,b) to points it suffices to find out points P’ and P’’ on lines x and y which have distances a and b from the origin O, to that


foto 3


In this way we uniquely associate P to the pair (a,b): to come back from coordinates (a,b) to points it suffices to find out points P’ and P’’ on lines x and y which have distances a and b from the origin O, to that we draw the perpendicular to the axes x and y through P’ and P’’ and get the point P as intersection point of these two perpendiculars





The plane of Euclidean corresponds one-to-one to the set R2 of pairs of real numbers: to each pair, it corresponds one and only one number.

The notation R2 denotes the ordered pairs of elements of R. In general, if X is a set, Xn is the set of n-tuples of elements of  X.

Which advantages does it bring to identify the plane with the pairs of numbers (a,b)?

  • In the first place, we know what they are, we need not axioms anymore to describe what points, lines, etc. are.
  • Next, we transform geometric problems, involving complex constructions, into algebraic problems, which may be solved by computing.

Thus, Cartesian geometry is the ideal geometry for a computer, since we may reduce logical deduction and cumbersome constructions to numbers and computations. It is not by chance that 2D libraries which are used to draw on a computer screen, or on the window of an application, use coordinates to represent points (plus come attributes for colours, etc.).

For example let us ask: how is a curve represented in the Cartesian plane?

There are two ways:

  • By means of a Cartesian equation, thus a single algebraic relations between the coordinates od any point belonging to the curve.
  • By means of a parametric equation, which describes the algebraic expression of a generic point in the curve, depending on a real parameter.

For example consider the equation x2+y2 = 1 and the equation 2x+y = 0. The former represents a circumference centred in the origin and with radius 1, the latter a line passing through the origin.

Suppose we want to compute their intersections: on using Euclidean geometry we should find out that intersection by means of a geometric construction somehow, while according to the Cartesian geometry approach we are actually asking: which are the points whose coordinates (x,y) simultaneously satisfy the following relations: x2+y2 = 1 e 2x+y = 0?

The latter equation implies y = –2x, which substituted in the former equation provides x2+(–2x)2 = 1, thus x2+4x2 = 1 hence x2 = 1/5, which has two solutions x = 1/√5 and x = –1/√5. Therefore we’ve found the two required intersection points: (1/√5, –2/√5) and (–1/√5, 2/√5).


One could proceed in this way toward higher dimensions: thus, Cartesian space is the set of triples (x,y,z) of real numbers, thus the set R3.

Therefore, a point in the space is uniquely determined by its three Cartesian coordinates, which we may plug into equations to solve geometric problems after we translated them into algebraic ones.


The distance between two points

We have said that along with a line distance are computed by the absolute value of the difference. How are distances in the plane computed?

Interestingly there are different ways to do that, which are all generalizations of the distance on a line: let us the three such distances, which are all particular cases of the so-called Minkowski distances:

  • Euclidean distance
  • Manhattan (or taxi-cab) distance
  • Uniform distance.

Euclidean distance between two points P = (x1,y1) and Q = (x2,y2) in the plane is computed by the following formula:

foto 4

Notice that, if both P and Q belong to the x-axis, this distance reduces to the absolute difference of coordinates: indeed, on supposing that P = (x1,0) and Q = (x2,0) the previous formula becomes   (as usual when we take the square root of a number we consider the positive one).

foto 5

foto 6


(as usual when we take the square root of a number we consider the positive one).

Euclidean distance should be actually called “Pythagorean distance”, since it may be inferred after Pythagoras’ theorem: indeed, let us consider a right triangle OPQ with a vertex in the origin, one cathetus on the x-axis and the other one parallel to the y-axis.



Therefore the coordinates of P will be (x,0) and the coordinates of Q will be (x,y): the distance between O and P is x and the distance between Q and P is y, so that, by Pythagoras’ theorem, d(O,Q)2 = d(O,P)2 + d(P,Q)2 = x2 + y2. Hence in this case, on taking the square root:

foto 7

and the formula is proven. In the general case the reasoning is similar, but coordinates of O (which in our case are 0) have also to be taken into account.


foto 8

Thus: Euclidean distance is just the distance between the end points of a linear segment, and it has a simple geometric description.

Distance among points in space is measured in the same way, just  by adding one more coordinate to points: if P = (x1,y1,z1) and Q = (x2,y2,z2) are points in the Cartesian space then their Euclidean distance is:

foto 9


For the sake of completeness we explain two more ways to compute distances in the plane, whose outcomes are different from Euclidean distance but which nevertheless are useful to measure when two points are near (a nice theorem which goes beyond the scope of this tutorial claims that as far as the “nearness” relation is concerned, all those distances are equivalent).

The first one is a common distance in Machine Learning, namely the “Manhattan” distance or “taxi-cab distance”:

foto 10


Notice that, if both P and Q are on the x-line then this distance coincides with the absolute value of their difference. The “taxi-cab” name stems from the fact that this distance does not involve Pythagoras’ theorem, which uses the diagonal shortcut, but starting from P describes the shortest path toward Q set up by horizontal and vertical segments.



foto 11

foto 11

This is like the path followed by a taxi to reach the opposite point of a building block: the car cannot pass through the block but has to follow its border.

Another distance which may prove to be useful is the “uniform distance”, defined as

foto 12

Thus this distance is computed by taking the maximum length of the segment given by the projection of the PQ segment on the x- and y-axes: again, if both points lie on the x-axis this distance equals the absolute value of the difference between point’s coordinates.

Given a distance, a (open) ball is the set of points whose distances are less or equal to a fixed one (its ray) from a fixed point O, the center: thus, if O is the center and r the ray then the ball contains all points such that d(P,O) < r. This terminology is self-apparent in the case of the Euclidean distance: as an exercise, try to figure out which subset of the Cartesian space is balls in the taxi-cab distance.








Functional programming for deep learning

Author: Joyce Xu

Before I started my most recent job at ThinkTopic, the concepts of “functional programming” and “machine learning” belonged to two different worlds entirely. One was a programming paradigm surging in popularity as the world turned towards simplicity, composability, and immutability to maintain complex scaling applications; the other was a tool to teach computers to autocomplete doodles and make music. Where was the overlap?

The more I worked with the two, the more I began realizing that the overlap is both practical and theoretical. Firstly, machine learning is not a stand-alone endeavor; it needs to be rapidly incorporated into complex scaling applications in industry. Secondly, machine learning — and deep learning in particular — is functional by design. Given the right ecosystem, there are several compelling reasons to perform deep learning in an entirely functional manner:

  • Deep learning models are compositionalFunctional programming is all about composing chains of higher-order functions to operate over simple data structures. Neural nets are designed the same way, chaining together function transformations from one layer to the next to operate over a simple matrix of input data. In fact, the entire process of deep learning can be viewed as optimizing a set of composed functions, meaning the models themselves are intrinsically functional.
  • Deep learning components are immutable.When functions operate over the input data, the data is not changed, a new set of values are outputted and passed on. Furthermore, when weights are updated, they do not need to be “mutated” — they can just be replaced by a new value. In theory, the updates to the weights can be applied in any order (i.e. they are not dependent on one another), so there is no need to keep track of a sequential, mutable state.
  • Functional programming offers easy parallelism. Most importantly, functions that are pure and composable are easy to parallelize. Parallelism means more speed and more compute power. Functional programming gives us concurrency and parallelism at essentially no cost, making it much easier to work with large, distributed models in deep learning.

There are many theories and perspectives regarding the combination of functional programming and deep learning, from mathematical arguments to practical overviews, but sometimes it’s most convincing (and useful) just to see it in practice. Here at ThinkTopic, we’ve been developing an open-source machine learning library called Cortex. For the rest of this post, I will introduce some ideas behind functional programming and put them to use in a Cortex deep learning model for anomaly detection.

Clojure Basics

Before we continue on our Cortex tutorial, I want to introduce some basics of Clojure. Clojure is a functional programming language that’s really good at two things: concurrency and data processing. Fortunately for us, both of those things are incredibly useful for machine learning. In fact, one of the primary reasons we use Clojure for machine learning is the fact that day-to-day work in preparing datasets for training (data manipulation, processing, etc.) can easily outweigh the work of implementing the algorithms, especially when we have a solid library such as Cortex for learning. Using Clojure and .edn (instead of C++ and protobuf), we can gain leverage and velocity on ML projects.

For a more in-depth introduction to the language, take a look at the community guide here.

On with the basics: Clojure code is made up of a bunch of expressionsthat are evaluated at run-time. These expressions are wrapped in parentheses, and are typically treated as function calls.

(+ 2 3)          ; => 5
(if false 1 0)   ; => 0

There are 4 basic collection data structures: vectors, lists, hash-maps, and sets. Commas are treated as whitespace, so they are typically omitted.

[1 2 3]            ; vector (ordered)
‘(1 2 3)           ; list (ordered)
{:a 1 :b 2 :c 3}   ; hashmap or map (unordered)
#{1 2 3}           ; set (unordered, unique values)

The single quote in front of the list simply prevents it from being evaluated as an expression.

Clojure also comes with many, many, built-in functions to operate over these data structures. Part of the beauty of Clojure is that it was designed to have many functions for very few data types, as opposed to to having a few specialized functions for each of many data types. Being an FP language, Clojure supports higher-order functions, meaning functions can be passed around as arguments to other functions.

(count [a b c])              ; => 3

(range 5)                    ; => (0 1 2 3 4)

(take 2 (drop 5 (range 10))) ; => (5 6)

(:b {:a 1 :b 2 :c 3})        ; use keyword as function => 2

(map inc [1 2 3])            ; map and increment => (2 3 4)

(filter even? (range 5))     ; filter collection based off predicate => (0 2 4)

(reduce + [1 2 3 4])         ; apply + to first two elements, then apply + to that result and the 3rd element, and so forth => 10

Of course, we can also write our own functions in Clojure, using defn. Clojure function definitions follow the form (defn fn-name [params*] expressions), and they always return the value of the last expression in the body.

(defn add2
(+ x 2))(add2 5)     ; => 7

let expressions create and bind variables within the lexical scope of the “let”. That is, in of the expression (let [a 4] (…)), the variable “a” takes on a value of 4 inside (and only inside) the inner parentheses. These variables are called “locals.”

(defn square-and-add
[a b]
(let [a-squared (* a a)
b-squared (* b b)]
(+ a-squared b-squared)))

(square-and-add 3 4)       ; => 225

That’s it for the basics! Now that we’ve learned some Clojure, let’s put the fun in functional programming and get back to some ML.


Cortex is written in Clojure, and is currently one of the largest and fastest-growing machine learning libraries that uses a functional programming language. The rest of this post will walk through how to build a state-of-the-art classification model in Cortex, and the functional programming paradigms and data augmentation techniques required to do so.

Data Preprocessing

Our dataset is going to be the credit card fraud detection data provided by Kaggle here. It turns out this dataset is incredibly imbalanced, containing only 492 positive fraud cases out of 284,807. That’s 0.172%. This is going to cause problems for us later, but first let’s just take a look at the data and see how the model does.

In order to ensure anonymity of personal data, all the original features except “time” and “amount” have already been transformed to PCA components (where each entry represents a new variable that contains the most relevant information from the raw data). A little data exploration will show that the first “time” variable is fairly uninformative, so we’ll drop that as we’re reading in the data. Here is what our initial code looks like:


(ns fraud-detection.core

(:require [clojure.java.io :as io]

[clojure.string :as string] [clojure.data.csv :as csv] [clojure.core.matrix :as mat] [clojure.core.matrix.stats :as matstats]

[cortex.nn.layers :as layers]

[cortex.nn.network :as network]

[cortex.nn.execute :as execute]

[cortex.optimize.adadelta :as adadelta]

[cortex.optimize.adam :as adam]

[cortex.metrics :as metrics]

[cortex.util :as util]

[cortex.experiment.util :as experiment-util]

[cortex.experiment.train :as experiment-train]))

(def orig-data-file “resources/creditcard.csv”)

(def log-file “training.log”)

(def network-file “trained-network.nippy”)

;; Read input csv and create a vector of maps {:data […] :label [..]},

;; where each map represents one training instance in the data

(defonce create-dataset


(fn []

(let [credit-data (with-open [infile (io/reader orig-data-file)]

(rest (doall (csv/read-csv infile))))

data (mapv #(mapv read-string %) (map #(drop 1 %) (map drop-last credit-data))) ; drop label and time

labels (mapv #(util/idx->one-hot (read-string %) 2) (map last credit-data))

dataset (mapv (fn [d l] {:data d :label l}) data labels)]



Cortex neural nets expect input data in the form of maps, such that each map represents a single labeled data point. For example, a classification dataset could look like [{:data [12 10 38] :label “cat”} {:data [20 39 3] :label “dog“} … ]. In our create-dataset function, we read in the csv data file, designate all but the last column to be the “data” (or features), and designate the last column to be the labels. In the process, we turn the labels into one-hot vectors (e.g. [0 1 0 0]) based on the classification class, because the last softmax layer of our neural net returns a vector of class probabilities, not the actual label. Finally, we create a map from these two variables and return it as the dataset.

Model Description

Creating a model in Cortex is fairly straightforward. First, we’re going to define a map of hyper-parameters to be used later during training. Then, to define a model, we simply string the layers together:


(def params

{:test-ds-size      50000 ;; total = 284807, test-ds ~= 17.5%

:optimizer         (adam/adam)   ;; alternately, (adadelta/adadelta)

:batch-size        100

:epoch-count       50

:epoch-size        200000})

(def network-description

[(layers/input (count (:data (first (create-dataset)))) 1 1 :id :data) ;width, height, channels, args

(layers/linear->relu 20) ; num-output & args

(layers/dropout 0.9)

(layers/linear->relu 10)

(layers/linear 2)

(layers/softmax :id :label)])

network-description is a vector of neural network layers. Our model consists of:

  • an input layer
  • a fully-connected (linear) layer with the ReLU activation function
  • a dropout layer
  • another fully-connected ReLU layer
  • an output layer of size 2 that is passed through the softmax function.

In both the first and the last layers, we need to specify an :id. This id refers to the key in the data map that our network should look at. (Recall that the data map looks like {:data […] :label […]}). For our input layer, we pass in the :data id to tell the model to grab the training data for its forward passes. In our final network layer, we provide :label as the :id, so the model can use the true label to calculate our error with.

Training and Evaluation

Here’s where it gets a little more difficult. The train function itself is actually not so complicated — Cortex provides a nice, high-level call for training, so all we have to do is pass in our parameters (the network, training and testing dataset, etc.). The only caveat is that that the system expects an effectively “infinite” dataset for training, but Cortex provides a function (infinite-class-balanced-dataset) to help us transform it.

(defn train

“Trains network for :epoch-count number of epochs”


(let [network (network/linear-network network-description)

[train-orig test-ds] (get-train-test-dataset)

train-ds (experiment-util/infinite-class-balanced-dataset train-orig

:class-key :label

:epoch-size (:epoch-size params))]

(experiment-train/train-n network train-ds test-ds

:batch-size (:batch-size params)

:epoch-count (:epoch-count params)

:optimizer (:optimizer params)

:test-fn f1-test-fn)))

The complicated part is the f1-test-fn. Here’s the thing: during training, the train-n function expects to be provided with a :test-fnthat evaluates how well the model is performing and determines whether or not it should be saved as the “best network.” There is a default test function that evaluates cross-entropy loss, but this loss value is not so easy to interpret, and it doesn’t suit our imbalanced dataset very well. To get around this problem, we’re going to write our own test function.

But how are we going to test the performance of the model? The standard metric in classification tasks is accuracy, but in a dataset as imbalanced as ours, accuracy is a fairly useless metric. Because positive (fraudulent) examples account for just 0.172% of our dataset, even a model that exclusively predicts negative examples would achieve 99.828% accuracy. 99.828% is a pretty darn good accuracy, but if Amazon really used this model, we may as well all turn to a life of crime and credit card fraud.

Thankfully, Amazon does not use this kind of model, and neither shall we. A much more telling set of metrics is precision, recall, and the F1 (or more generally F-beta) score.

1 1In layman’s terms, precision asks the question: “of all the examples I guessed were positive, what proportion were actually positive?” and recall asks the question: “of all the examples that were actually positive, what proportion did I correctly guess as positive?”

The F-beta score (a generalization of the traditional F1 score) is a weighted average of precision and recall, also measured on a scale of 0 to 1:

2.jpg 2

When beta = 1, we get the standard F1 measure of 2 * (precision * recall) / (precision + recall). In general, beta represents how many times more important recall should be than precision. For our fraud detection model, we’ll use the F1 score as our high score to track, but we’ll log the precision and recall scores as well to check the balance. This is our f1-test-fn:


(defn f-beta

“F-beta score, default uses F1”

([precision recall] (f-beta precision recall 1))

([precision recall beta]

(let [beta-squared (* beta beta)]

(* (+ 1 beta-squared)

(try                         ;; catch divide by 0 errors

(/ (* precision recall)

(+ (* beta-squared precision) recall))

(catch ArithmeticException e


(defn f1-test-fn

“Test function that takes in two map arguments, global info and local epoch info.

Compares F1 score of current network to that of the previous network,

and returns map:

{:best-network? boolean

:network (assoc new-network :evaluation-score-to-compare)}”

[;; global arguments

{:keys [batch-size context]}

;per-epoch arguments

{:keys [new-network old-network test-ds]} ]

(let [batch-size (long batch-size)

test-results (execute/run new-network test-ds

:batch-size batch-size

:loss-outputs? true

:context context)

;;; test metrics

test-actual (mapv #(vec->label [0.0 1.0] %) (map :label test-ds))

test-pred (mapv #(vec->label [0.0 1.0] % [1 0.9]) (map :label test-results))

precision (metrics/precision test-actual test-pred)

recall (metrics/recall test-actual test-pred)

f-beta (f-beta precision recall)

;; if current f-beta higher than the old network’s, current is best network

best-network? (or (nil? (get old-network :cv-score))

(> f-beta (get old-network :cv-score)))

updated-network (assoc new-network :cv-score f-beta)

epoch (get new-network :epoch-count)]

(experiment-train/save-network updated-network network-file)

(log (str “Epoch: ” epoch “\n”

“Precision: ” precision  “\n”

“Recall: ” recall “\n”

“F1: ” f-beta “\n\n”))

{:best-network? best-network?

:network updated-network}))

The function runs the current network on the test set, calculates the F1 score, and updates/saves the network accordingly. It also prints out our evaluation metrics at each epoch. If we run (train) in the REPL now, we get a high score that something that looks like this:

Epoch: 30
Precision: 0.2515923566878981
Recall: 0.9186046511627907
F1: 0.395

Haha. That’s pretty embarrassingly bad.

Data Augmentation

Here’s the problem. Remember how I said our highly imbalanced dataset was going to cause issues for us later? The model currently does not have enough positive examples to learn from. When we call experiment-util/infinite-class-balanced-dataset in our train function, we’re actually creating hundreds of copies of each positive training instance to balance out the dataset. As a result, the model is effectively memorizing those feature values and not actually learning the distinction between the classes.

One way around this problem is through data augmentation, in which we generate additional, artificial data based on the examples we already have. In order to create realistic positive training examples, we are going to add random amounts of noise to the feature vectors of each of our existing positive examples. The amount of noise we add will be dependent on the variance of each feature across the positive class, such that features with a large variance will be augmented with a large amount of noise, and vice versa for features with small variances.

Here is our code for data augmentation:

(defonce get-scaled-variances


(fn []

(let [{positives true negatives false} (group-by #(= (:label %) [0.0 1.0]) (create-dataset))

pos-data (mat/matrix (map #(:data %) positives))

variances (mat/matrix (map #(matstats/variance %) (mat/columns pos-data)))

scaled-vars (mat/mul (/ 5000 (mat/length variances)) variances)]


(defn add-rand-variance

“Given vector v, add random vector based off the variance of each feature”

[v scaled-vars]

(let [randv (map #(- (* 2 (rand %)) %) scaled-vars)]

(mapv + v randv)))

(defn augment-train-ds

“Takes train dataset and augments positive examples to reach 50/50 balance”


(let [{train-pos true train-neg false} (group-by #(= (:label %) [0.0 1.0]) orig-train)

pos-data (map #(:data %) train-pos)

num-augments (- (count train-neg) (count train-pos))

augments-per-sample (int (/ num-augments (count train-pos)))

augmented-data (apply concat (repeatedly augments-per-sample

#(mapv (fn [p] (add-rand-variance p (get-scaled-variances))) pos-data)))

augmented-ds (mapv (fn [d] {:data d :label [0 1]}) augmented-data)]

(shuffle (concat orig-train augmented-ds))))

augment-train-ds takes our original train dataset, calculates the number of augmentations that have to be made to reach a 50/50 class balance, and applies those augmentations to our existing samples by adding a random noise vector (add-rand-variance) based on the allowed variance (get-scaled-variances). In the end, we concatenate the augmented examples back in to the original dataset and return the balanced dataset.

During training, the model will be seeing an unrealistically large amount of positive examples, while the test set will still be only 0.172% positives. As a result, while the model may be able to learn the differences between the two classes better, it will over-predict positive examples during testing. In order to fix this, we can require a higher threshold of certainty to predict “positive” during testing. In other words, instead of requiring the model to be at least 50% certain that an example is positive in order to classify it as such, we can require it to be at least 70% certain. After some testing, I found the optimal value to be set at 90%. The code for this can be found in the vec->label function in the source code, and is called on line 31 of the f1-test-fn.

Using the new, augmented dataset for training, our high scores look something like this:

Epoch: 25
Precision: 0.8658536585365854
Recall: 0.8255813953488372
F1: 0.8452380952380953

Much better!


As always, the model can still be improved. Here are a few ideas for next steps:

  • Are all the PCA features informative? Take a look at the distribution of values for positive and negative examples across the features, and drop any features that do not help distinguish between the two classes.
  • Are there other neural net architectures, activation functions, etc. that perform better?
  • Are there different data augmentation techniques that would perform better?
  • How does model performance in Cortex compare to Keras/Tensorflow/Theano/Caffe?

The source code for the project can be found in its entirety here. I encourage you to try some of these next steps, test out new datasets, and explore different network architectures (we have a great image classification example for reference on conv nets). Cortex is pushing towards its 1.0 release, so if you have any thoughts, recommendations, or feedback, be sure to let us know. Happy hacking!


Deep Learning for Object Detection: A Comprehensive Review

Author Joyce Xu






With the rise of autonomous vehicles, smart video surveillance, facial detection and various people counting applications, fast and accurate object detection systems are rising in demand. These systems involve not only recognizing and classifying every object in an image, but localizing each one by drawing the appropriate bounding box around it. This makes object detection a significantly harder task than its traditional computer vision predecessor, image classification.

Fortunately, however, the most successful approaches to object detection are currently extensions of image classification models. A few months ago, Google released a new object detection API for Tensorflow. With this release came the pre-built architectures and weights for a few specific models:

In my last blog post, I covered the intuition behind the three base network architectures listed above: MobileNets, Inception, and ResNet. This time around, I want to do the same for Tensorflow’s object detection models: Faster R-CNN, R-FCN, and SSD. By the end of this post, we will hopefully have gained an understanding of how deep learning is applied to object detection, and how these object detection models both inspire and diverge from one another.

Faster R-CNN

Faster R-CNN is now a canonical model for deep learning-based object detection. It helped inspire many detection and segmentation models that came after it, including the two others we’re going to examine today. Unfortunately, we can’t really begin to understand Faster R-CNN without understanding its own predecessors, R-CNN and Fast R-CNN, so let’s take a quick dive into its ancestry.


R-CNN is the grand-daddy of Faster R-CNN. In other words, R-CNN reallykicked things off.

R-CNN, or Region-based Convolutional Neural Network, consisted of 3 simple steps:

  1. Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
  2. Run a convolutional neural net (CNN) on top of each of these region proposals
  3. Take the output of each CNNand feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.

These 3 steps are illustrated in the image below:



In other words, we first propose regions, then extract features, and then classify those regions based on their features. In essence, we have turned object detection into an image classification problem. R-CNN was very intuitive, but very slow.

Fast R-CNN

R-CNN’s immediate descendant was Fast-R-CNN. Fast R-CNN resembled the original in many ways, but improved on its detection speed through two main augmentations:

  1. Performing feature extraction over the image beforeproposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
  2. Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model

The new model looked something like this:


3 1

As we can see from the image, we are now generating region proposals based on the last feature map of the network, not from the original image itself. As a result, we can train just one CNN for the entire image.

In addition, instead of training many different SVM’s to classify each object class, there is a single softmax layer that outputs the class probabilities directly. Now we only have one neural net to train, as opposed to one neural net and many SVM’s.

Fast R-CNN performed much better in terms of speed. There was just one big bottleneck remaining: the selective search algorithm for generating region proposals.

Faster R-CNN

At this point, we’re back to our original target: Faster R-CNN. The main insight of Faster R-CNN was to replace the slow selective search algorithm with a fast neural net. Specifically, it introduced the region proposal network (RPN).

Here’s how the RPN worked:

  • At the last layer of an initial CNN, a 3×3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
  • For each sliding-window location, it generates multiplepossible regions based on k fixed-ratio anchor boxes(default bounding boxes)
  • Each region proposal consists of a) an “objectness” score for that region and b) 4 coordinates representing the bounding box of the region

In other words, we look at each location in our last feature map and consider kdifferent boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:

4 1

4 1

The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.

Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.


5 1

5 1

Altogether, Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. In other words, Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. Case in point, Tensorflow’s Faster R-CNN with Inception ResNet is their slowest but most accurate model.

At the end of the day, Faster R-CNN may look complicated, but its core design is the same as the original R-CNN: hypothesize object regions and then classify them. This is now the predominant pipeline for many object detection models, including our next one.


Remember how Fast R-CNN improved on the original’s detection speed by sharing a single CNN computation across all region proposals? That kind of thinking was also the motivation behind R-FCN: increase speed by maximizing shared computation.

R-FCN, or Region-based FullyConvolutional Net, shares 100% of the computations across every single output. Being fully convolutional, it ran into a unique problem in model design.

On the one hand, when performing classification of an object, we want to learn location invariance in a model: regardless of where the cat appears in the image, we want to classify it as a cat. On the other hand, when performing detection of the object, we want to learn location variance: if the cat is in the top left-hand corner, we want to draw a box in the top left-hand corner. So if we’re trying to share convolutional computations across 100% of the net, how do we compromise between location invariance and location variance?

R-FCN’s solution: position-sensitive score maps.

Each position-sensitive score map represents one relative position of one object class. For example, one score map might activate wherever it detects the top-right of a cat. Another score map might activate where it sees the bottom-left of a car. You get the point. Essentially, these score maps are convolutional feature maps that have been trained to recognize certain parts of each object.

Now, R-FCN works as follows:

  1. Run a CNN (in this case, ResNet) over the input image
  2. Add a fully convolutional layer to generate a score bankof the aforementioned “position-sensitive score maps.” There should be k²(C+1) score maps, with k² representing the number of relative positions to divide an object (e.g. 3² for a 3 by 3 grid) and C+1 representing the number of classes plus the background.
  3. Run a fully convolutional region proposal network (RPN) to generate regions of interest (RoI’s)
  4. For each RoI, divide it into the same k² “bins” or subregions as the score maps
  5. For each bin, check the score bank to see if that bin matches the corresponding position of some object. For example, if I’m on the “upper-left” bin, I will grab the score maps that correspond to the “upper-left” corner of an object and average those values in the RoI region. This process is repeated for each class.
  6. Once each of the k² bins has an “object match” value for each class, average the bins to get a single score per class.
  7. Classify the RoI with a softmax over the remaining C+1 dimensional vector

Altogether, R-FCN looks something like this, with an RPN generating the RoI’s:

6 1

6 1

Even with the explanation and the image, you might still be a little confused on how this model works. Honestly, R-FCN is much easier to understand when you can visualize what it’s doing. Here is one such example of an R-FCN in practice, detecting a baby:






Simply put, R-FCN considers each region proposal, divides it up into sub-regions, and iterates over the sub-regions asking: “does this look like the top-left of a baby?”, “does this look like the top-center of a baby?” “does this look like the top-right of a baby?”, etc. It repeats this for all possible classes. If enough of the sub-regions say “yes, I match up with that part of a baby!”, the RoI gets classified as a baby after a softmax over all the classes.

With this setup, R-FCN is able to simultaneously address location variance by proposing different object regions, and location invariance by having each region proposal refer back to the same bank of score maps. These score maps should learn to classify a cat as a cat, regardless of where the cat appears. Best of all, it is fully convolutional, meaning all of the computation is shared throughout the network.

As a result, R-FCN is several times faster than Faster R-CNN, and achieves comparable accuracy.


Our final model is SSD, which stands for Single-Shot Detector. Like R-FCN, it provides enormous speed gains over Faster R-CNN, but does so in a markedly different manner.

Our first two models performed region proposals and region classifications in two separate steps. First, they used a region proposal network to generate regions of interest; next, they used either fully-connected layers or position-sensitive convolutional layers to classify those regions. SSD does the two in a “single shot,” simultaneously predicting the bounding box and the class as it processes the image.

Concretely, given an input image and a set of ground truth labels, SSD does the following:

  1. Pass the image through a series of convolutional layers, yielding several sets of feature maps at different scales (e.g. 10×10, then 6×6, then 3×3, etc.)
  2. For each location in eachof these feature maps, use a 3×3 convolutional filter to evaluate a small set of default bounding boxes. These default bounding boxes are essentially equivalent to Faster R-CNN’s anchor boxes.
  3. For each box, simultaneously predict a) the bounding box offset and b) the class probabilities
  4. During training, match the ground truth box with these predicted boxes based on IoU. The best predicted box will be labeled a “positive,” along with all other boxes that have an IoU with the truth >0.5.

SSD sounds straightforward, but training it has a unique challenge. With the previous two models, the region proposal network ensured that everything we tried to classify had some minimum probability of being an “object.” With SSD, however, we skip that filtering step. We classify and draw bounding boxes from every single position in the image, using multiple different shapes, at several different scales. As a result, we generate a much greater number of bounding boxes than the other models, and nearly all of the them are negative examples.

To fix this imbalance, SSD does two things. Firstly, it uses non-maximum suppression to group together highly-overlapping boxes into a single box. In other words, if four boxes of similar shapes, sizes, etc. contain the same dog, NMS would keep the one with the highest confidence and discard the rest. Secondly, the model uses a technique called hard negative mining to balance classes during training. In hard negative mining, only a subset of the negative examples with the highest training loss (i.e. false positives) are used at each iteration of training. SSD keeps a 3:1 ratio of negatives to positives.

Its architecture looks like this:



As I mentioned above, there are “extra feature layers” at the end that scale down in size. These varying-size feature maps help capture objects of different sizes. For example, here is SSD in action


In smaller feature maps (e.g. 4×4), each cell covers a larger region of the image, enabling them to detect larger objects. Region proposal and classification are performed simultaneously: given p object classes, each bounding box is associated with a (4+p)-dimensional vector that outputs 4 box offset coordinates and pclass probabilities. In the last step, softmax is again used to classify the object.

Ultimately, SSD is not so different from the first two models. It simply skips the “region proposal” step, instead considering every single bounding box in every location of the image simultaneously with its classification. Because SSD does everything in one shot, it is the fastest of the three models, and still performs quite comparably.


Faster R-CNN, R-FCN, and SSD are three of the best and most widely used object detection models out there right now. Other popular models tend to be fairly similar to these three, all relying on deep CNN’s (read: ResNet, Inception, etc.) to do the initial heavy lifting and largely following the same proposal/classification pipeline.

At this point, putting these models to use just requires knowing Tensorflow’s API. Tensorflow has a starter tutorial on using these models here. Give it a try, and happy hacking!



An intuitive guide to deep network architectures

Author: Joyce Xu




GoogLeNet, 2014

Over the past few years, much of the progress in deep learning for computer vision can be boiled down to just a handful of neural network architectures. Setting aside all the math, the code, and the implementation details, I wanted to explore one simple question: how and why do these models work?

At the time of writing, Keras ships with six of these pre-trained models already built into the library:

  • VGG16
  • VGG19
  • ResNet50
  • Inception v3
  • Xception
  • MobileNet

The VGG networks, along with the earlier AlexNet from 2012, follow the now archetypal layout of basic conv nets: a series of convolutional, max-pooling, and activation layers before some fully-connected classification layers at the end. MobileNet is essentially a streamlined version of the Xception architecture optimized for mobile applications. The remaining three, however, truly redefine the way we look at neural networks.

This rest of this post will focus on the intuition behind the ResNet, Inception, and Xception architectures, and why they have become building blocks for so many subsequent works in computer vision.


ResNet was born from a beautifully simple observation: why do very deep nets perform worse as you keep adding layers?

Intuitively, deeper nets should perform no worse than their shallower counterparts, at least at train time (when there is no risk of overfitting). As a thought experiment, let’s say we’ve built a net with n layers that achieves a certain accuracy. At minimum, a net with n+1layers should be able to achieve the exact same accuracy, if only by copying over the same first n layers and performing an identity mapping for the last layer. Similarly, nets of n+2n+3, and n+4layers could all continue performing identity mappings and achieve the same accuracy. In practice, however, these deeper nets almost always degrade in performance.

The authors of ResNet boiled these problems down to a single hypothesis: direct mappings are hard to learn. And they proposed a fix: instead of trying to learn an underlying mapping from x to H(x), learn the difference between the two, or the “residual.” Then, to calculate H(x), we can just add the residual to the input.

Say the residual is F(x)=H(x)-x. Now, instead of trying to learn H(x) directly, our nets are trying to learn F(x)+x.

This gives rise to the famous ResNet (or “residual network”) block you’ve probably seen:



ResNet block


Each “block” in ResNet consists of a series of layers and a “shortcut” connection adding the input of the block to its output. The “add” operation is performed element-wise, and if the input and output are of different sizes, zero-padding or projections (via 1×1 convolutions) can be used to create matching dimensions.

If we go back to our thought experiment, this simplifies our construction of identity layers greatly. Intuitively, it’s much easier to learn to push F(x) to 0 and leave the output as x than to learn an identity transformation from scratch. In general, ResNet gives layers a “reference” point — x — to start learning from.

This idea works astoundingly well in practice. Previously, deep neural nets often suffered from the problem of vanishing gradients, in which gradient signals from the error function decreased exponentially as they backpropogated to earlier layers. In essence, by the time the error signals traveled all the way back to the early layers, they were so small that the net couldn’t learn. However, because the gradient signal in ResNets could travel back directly to early layers via shortcut connections, we could suddenly build 50-layer, 101-layer, 152-layer, and even (apparently) 1000+ layer nets that still performed well. At the time, this was a huge leap forward from the previous state-of-the-art, which won the ILSVRC 2014 challenge with 22 layers.

ResNet is one of my personal favorite developments in the neural network world. So many deep learning papers come out with minor improvements from hacking away at the math, the optimizations, and the training process without thought to the underlying task of the model. ResNet fundamentally changed the way we understand neural networks and how they learn.

Fun facts:

  • The 1000+ layer net is open-source! I would not reallyrecommend you try re-training it, but…
  • If you’re feeling functional and a little frisky, I recently ported ResNet50 to the open-source Clojure ML library Cortex. Try it out and see how it compares to Keras!


If ResNet was all about going deeper, the Inception Family™ is all about going wider. In particular, the authors of Inception were interested in the computational efficiency of training larger nets. In other words: how can we scale up neural nets without increasing computational cost?

The original paper focused on a new building block for deep nets, a block now known as the “Inception module.” At its core, this module is the product of two key insights.

The first insight relates to layer operations. In a traditional conv net, each layer extracts information from the previous layer in order to transform the input data into a more useful representation. However, each layer type extracts a different kind of information. The output of a 5×5 convolutional kernel tells us something different from the output of a 3×3 convolutional kernel, which tells us something different from the output of a max-pooling kernel, and so on and so on. At any given layer, how do we know what transformation provides the most “useful” information?

Insight #1: why not let the model choose?

An Inception module computes multiple different transformations over the same input map in parallel, concatenating their results into a single output. In other words, for each layer, Inception does a 5×5 convolutional transformation, and a 3×3, and a max-pool. And the next layer of the model gets to decide if (and how) to use each piece of information.

3The increased information density of this model architecture comes with one glaring problem: we’ve drastically increased computational costs. Not only are large (e.g. 5×5) convolutional filters inherently expensive to compute, stacking multiple different filters side by side greatly increases the number of feature maps per layer. And this increase becomes a deadly bottleneck in our model.

Think about it this way. For each additional filter added, we have to convolve over all the input maps to calculate a single output. See the image below: creating one output map from a single filter involves computing over every single map from the previous layer.


Let’s say there are M input maps. One additional filter means convolving over Mmore maps; N additional filters means convolving over N*M more maps. In other words, as the authors note, “any uniform increase in the number of [filters] results in a quadratic increase of computation.” Our naive Inception module just tripled or quadrupled the number of filters. Computationally speaking, this is a Big Bad Thing.

This leads to insight #2: using 1×1 convolutions to perform dimensionality reduction. In order to solve the computational bottleneck, the authors of Inception used 1×1 convolutions to “filter” the depth of the outputs. A 1×1 convolution only looks at one value at a time, but across multiple channels, it can extract spatial information and compress it down to a lower dimension. For example, using 20 1×1 filters, an input of size 64x64x100 (with 100 feature maps) can be compressed down to 64x64x20. By reducing the number of input maps, the authors of Inception were able to stack different layer transformations in parallel, resulting in nets that were simultaneously deep (many layers) and “wide” (many parallel operations).

5How well did this work? The first version of Inception, dubbed “GoogLeNet,” was the 22-layer winner of the ILSVRC 2014 competition I mentioned earlier. Inception v2 and v3 were developed in a second paper a year later, and improved on the original in several ways — most notably by refactoring larger convolutions into consecutive smaller ones that were easier to learn. In v3, for example, the 5×5 convolution was replaced with 2 consecutive 3×3 convolutions.

Inception rapidly became a defining model architecture. The latest version of Inception, v4, even threw in residual connections within each module, creating an Inception-ResNet hybrid. Most importantly, however, Inception demonstrated the power of well-designed “network-in-network” architectures, adding yet another step to the representational power of neural networks.

Fun facts:

  • The original Inception paper literally cites the “we need to go deeper” internet meme as an inspiration for its name. This must be the first time knowyourmeme.com got listed as the first reference of a Google paper.
  • The second Inception paper (with v2 and v3) was released just one day after the original ResNet paper. December 2015 was a good time for deep learning.


Xception stands for “extreme inception.” Rather like our previous two architectures, it reframes the way we look at neural nets — conv nets in particular. And, as the name suggests, it takes the principles of Inception to an extreme.

Here’s the hypothesis: “cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly.”

What does this mean? Well, in a traditional conv net, convolutional layers seek out correlations across both spaceand depth. Let’s take another look at our standard convolutional layer:


In the image above, the filter simultaneously considers a spatial dimension (each 2×2 colored square) and a cross-channel or “depth” dimension (the stack of four squares). At the input layer of an image, this is equivalent to a convolutional filter looking at a 2×2 patch of pixels across all three RGB channels. Here’s the question: is there any reason we need to consider both the image region and the channels at the same time?

In Inception, we began separating the two slightly. We used 1×1 convolutions to project the original input into several separate, smaller input spaces, and from each of those input spaces we used a different type of filter to transform those smaller 3D blocks of data. Xception takes this one step further. Instead of partitioning input data into several compressed chunks, it maps the spatial correlations for each output channel separately, and then performs a 1×1 depthwise convolution to capture cross-channel correlation.


6The author notes that this is essentially equivalent to an existing operation known as a “depthwise separable convolution,” which consists of a depthwise convolution (a spatial convolution performed independently for each channel) followed by a pointwise convolution (a 1×1 convolution across channels). We can think of this as looking for correlations across a 2D space first, followed by looking for correlations across a 1D space. Intuitively, this 2D + 1D mapping is easier to learn than a full 3D mapping.

And it works! Xception slightly outperforms Inception v3 on the ImageNet dataset, and vastly outperforms it on a larger image classification dataset with 17,000 classes. Most importantly, it has the same number of model parameters as Inception, implying a greater computational efficiency. Xception is much newer (it came out in April 2017), but as mentioned above, its architecture is already powering Google’s mobile vision applications through MobileNet.

Fun facts:

  • The author of Xception is also the author of Keras. Francois Chollet is a living god.

Moving forward

That’s it for ResNet, Inception, and Xception! I firmly believe in having a strong intuitive understanding of these networks, because they are becoming ubiquitous in research and industry alike. We can even use them in our own applications with something called transfer learning.

Transfer learning is a technique in machine learning in which we apply knowledge from a source domain (e.g. ImageNet) to a target domain that may have significantly fewer data points. In practice, this generally involves initializing a model with pre-trained weights from ResNet, Inception, etc. and either using it as a feature extractor, or fine-tuning the last few layers on a new dataset. With transfer learning, these models can be re-purposed for any related task we want, from object detection for self-driving vehicles to generating captions for video clips.

To get started with transfer learning, Keras has a wonderful guide to fine-tuning models here. If it sounds interesting to you, check it out — and happy hacking!


Introduction to Autoencoders

Author: Nathan Hubens

Linkedind: https://www.linkedin.com/in/nathan-hubens

Autoencoders (AE) are neural networks that aims to copy their inputs to their outputs. They work by compressing the input into a latent-spacerepresentation, and then reconstructing the output from this representation. This kind of network is composed of two parts :

  1. Encoder: This is the part of the network that compresses the input into a latent-space representation. It can be represented by an encoding function h=f(x).
  2. Decoder: This part aims to reconstruct the input from the latent space representation. It can be represented by a decoding function r=g(h).
1 3

1 3

Architecture of an Autoencoder

The autoencoder as a whole can thus be described by the function g(f(x)) = rwhere you want r as close as the original input x.

Why copying the input to the output ?

If the only purpose of autoencoders was to copy the input to the output, they would be useless. Indeed, we hope that, by training the autoencoder to copy the input to the output, the latent representation will take on useful properties.

This can be achieved by creating constraints on the copying task. One way to obtain useful features from the autoencoder is to constrain h to have smaller dimensions than x, in this case the autoencoder is called undercomplete. By training an undercomplete representation, we force the autoencoder to learn the most salient features of the training data. If the autoencoder is given too much capacity, it can learn to perform the copying task without extracting any useful information about the distribution of the data. This can also occur if the dimension of the latent representation is the same as the input, and in the overcomplete case, where the dimension of the latent representation is greater than the input. In these cases, even a linear encoder and linear decoder can learn to copy the input to the output without learning anything useful about the data distribution. Ideally, one could train any architecture of autoencoder successfully, choosing the code dimension and the capacity of the encoder and decoder based on the complexity of distribution to be modeled.

What are autoencoders used for ?

Today data denoising and dimensionality reduction for data visualization are considered as two main interesting practical applications of autoencoders. With appropriate dimensionality and sparsity constraints, autoencoders can learn data projections that are more interesting than PCA or other basic techniques.

Autoencoders are learned automatically from data examples. It means that it is easy to train specialized instances of the algorithm that will perform well on a specific type of input and that it does not require any new engineering, only the appropriate training data.

However, autoencoders will do a poor job for image compression. As the autoencoder is trained on a given set of data, it will achieve reasonable compression results on data similar to the training set used but will be poor general-purpose image compressors. Compression techniques like JPEG will do vastly better.

Autoencoders are trained to preserve as much information as possible when an input is run through the encoder and then the decoder, but are also trained to make the new representation have various nice properties. Different kinds of autoencoders aim to achieve different kinds of properties. We will focus on four types on autoencoders.

Types of autoencoder :

In this article, the four following types of autoencoders will be described:

  1. Vanilla autoencoder
  2. Multilayer autoencoder
  3. Convolutional autoencoder
  4. Regularized autoencoder

In order to illustrate the different types of autoencoder, an example of each has been created, using the Keras framework and the MNIST dataset. The code for each type of autoencoder is available on my GitHub.

Vanilla autoencoder

In its simplest form, the autoencoder is a three layers net, i.e. a neural net with one hidden layer. The input and output are the same, and we learn how to reconstruct the input, for example using the adam optimizer and the mean squared error loss function.


input_size = 784
hidden_size = 64
output_size = 784

x = Input(shape=(input_size,))

# Encoder
h = Dense(hidden_size, activation='relu')(x)

# Decoder
r = Dense(output_size, activation='sigmoid')(h)

autoencoder = Model(input=x, output=r)
autoencoder.compile(optimizer='adam', loss='mse')

Here, we see that we have an undercomplete autoencoder as the hidden layer dimension (64) is smaller than the input (784). This constraint will impose our neural net to learn a compressed representation of data.


Multilayer autoencoder

If one hidden layer is not enough, we can obviously extend the autoencoder to more hidden layers.


input_size = 784
hidden_size = 128
code_size = 64

x = Input(shape=(input_size,))

# Encoder
hidden_1 = Dense(hidden_size, activation='relu')(x)
h = Dense(code_size, activation='relu')(hidden_1)

# Decoder
hidden_2 = Dense(hidden_size, activation='relu')(h)
r = Dense(input_size, activation='sigmoid')(hidden_2)

autoencoder = Model(input=x, output=r)
autoencoder.compile(optimizer='adam', loss='mse')

Now our implementation uses 3 hidden layers instead of just one. Any of the hidden layers can be picked as the feature representation but we will make the network symmetrical and use the middle-most layer.


Convolutional autoencoder

We may also ask ourselves: can autoencoders be used with Convolutions instead of Fully-connected layers ?

The answer is yes and the principle is the same, but using images (3D vectors) instead of flattened 1D vectors. The input image is downsampled to give a latent representation of smaller dimensions and force the autoencoder to learn a compressed version of the images.


x = Input(shape=(28, 28,1)) 

# Encoder
conv1_1 = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
pool1 = MaxPooling2D((2, 2), padding='same')(conv1_1)
conv1_2 = Conv2D(8, (3, 3), activation='relu', padding='same')(pool1)
pool2 = MaxPooling2D((2, 2), padding='same')(conv1_2)
conv1_3 = Conv2D(8, (3, 3), activation='relu', padding='same')(pool2)
h = MaxPooling2D((2, 2), padding='same')(conv1_3)

# Decoder
conv2_1 = Conv2D(8, (3, 3), activation='relu', padding='same')(h)
up1 = UpSampling2D((2, 2))(conv2_1)
conv2_2 = Conv2D(8, (3, 3), activation='relu', padding='same')(up1)
up2 = UpSampling2D((2, 2))(conv2_2)
conv2_3 = Conv2D(16, (3, 3), activation='relu')(up2)
up3 = UpSampling2D((2, 2))(conv2_3)
r = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(up3)

autoencoder = Model(input=x, output=r)
autoencoder.compile(optimizer='adam', loss='mse')

Regularized autoencoder

There are other ways we can constraint the reconstruction of an autoencoder than to impose a hidden layer of smaller dimension than the input. Rather than limiting the model capacity by keeping the encoder and decoder shallow and the code size small, regularized autoencoders use a loss function that encourages the model to have other properties besides the ability to copy its input to its output. In practice, we usually find two types of regularized autoencoder: the sparse autoencoder and the denoising autoencoder.

Sparse autoencoder : Sparse autoencoders are typically used to learn features for another task such as classification. An autoencoder that has been regularized to be sparse must respond to unique statistical features of the dataset it has been trained on, rather than simply acting as an identity function. In this way, training to perform the copying task with a sparsity penalty can yield a model that has learned useful features as a byproduct.

Another way we can constraint the reconstruction of autoencoder is to impose a constraint in its loss. We could, for example, add a reguralization term in the loss function. Doing this will make our autoencoder learn sparse representation of data.


input_size = 784
hidden_size = 64
output_size = 784

x = Input(shape=(input_size,))

# Encoder
h = Dense(hidden_size, activation='relu', activity_regularizer=regularizers.l1(10e-5))(x)

# Decoder
r = Dense(output_size, activation='sigmoid')(h)

autoencoder = Model(input=x, output=r)
autoencoder.compile(optimizer='adam', loss='mse')

Notice in our hidden layer, we added an l1 activity regularizer, that will apply a penalty to the loss function during the optimization phase. As a result, the representation is now sparser compared to the vanilla autoencoder.


Denoising autoencoder : Rather than adding a penalty to the loss function, we can obtain an autoencoder that learns something useful by changing the reconstruction error term of the loss function. This can be done by adding some noise of the input image and make the autoencoder learn to remove it. By this means, the encoder will extract the most important features and learn a robuster representation of the data.

x = Input(shape=(28, 28, 1))

# Encoder
conv1_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
pool1 = MaxPooling2D((2, 2), padding='same')(conv1_1)
conv1_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(pool1)
h = MaxPooling2D((2, 2), padding='same')(conv1_2)

# Decoder
conv2_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(h)
up1 = UpSampling2D((2, 2))(conv2_1)
conv2_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(up1)
up2 = UpSampling2D((2, 2))(conv2_2)
r = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(up2)

autoencoder = Model(input=x, output=r)
autoencoder.compile(optimizer='adam', loss='mse')


In this article, we went through the basic architecture of autoencoders. We also looked at many different types of autoencoders: vanilla, multilayer, convolutional and regularized. Each has different properties depending on the imposed constraints : either the reduced dimension of the hidden layers or another kind of penalty.

Using Deep Learning to improve FIFA 18 graphics

Author: Chintan Trivedi





Comparison of Cristiano Ronaldo’s face, with the left one from FIFA 18 and the right one generated by a Deep Neural Network.

Game Studios spend millions of dollars and thousands of development hours designing game graphics in trying to make them look as close to reality as possible. While the graphics have looked amazingly realistic in the last few years, it is still easy to distinguish them from the real world. However, with the massive advancements made in the field of image processing using Deep Neural Networks, is it time we can leverage that to improve the graphics while simultaneously also reducing the efforts required to create them?

Let us try to answer that using the game FIFA 18…

Football (i.e. soccer) being my favorite sport, FIFA becomes the natural game of choice for all of my deep learning experiments. To find out whether the recent developments in deep learning can help me answer my question, I tried to focus on improving the player faces in FIFA using the (in?)famous deepfakesalgorithm. It is a Deep Neural Network that can be trained to learn and generate extremely realistic human faces. My focus in this project lies on recreating the player faces from within the game and improving them to make them look exactly like the actual players.

Note: Here is a great explanation of how the deepfakes algorithm works. Tl;dr version: it can swap the face of anyone in a video with anybody else’s face using Autoencoders and Convolutional Neural Networks.

Gathering training data

Unlike the game developers, I could collect all required data from Google search without having to trouble Ronaldo with any motion-capture fancy dress.

Let us start by looking at one of the best designed faces in FIFA 18, that of Cristiano Ronaldo, and see if we can improve it. To gather the data required for the deepfakes algorithm, I simply recorded the player’s face from the instant replay option in the game. Now, we want to replace this face with the actual face of Ronaldo. For this, I downloaded a bunch of images from Google such that the images clearly show his face from different angles. That’s all that is needed to get us started with the training process of our model.

Model architecture & Training

The deepfakes algorithm involves training of deep neural networks called autoencoders. These networks are used for unsupervised learning and have an encoder that can encode an input to a compact representation called the “encoding”, and a decoder that can use this encoding to reconstruct the original input. This architecture forces the network to learn the underlying distribution of the input rather than simply parroting back the input. For images as our input, we use a convolutional net as our encoder and a deconvolutional net as our decoder. This architecture is trained to minimize the reconstruction error for unsupervised learning.

For our case, we train two autoencoder networks simultaneously. One network learns to recreate face of Ronaldo from FIFA 18 graphics. The other network learns to recreate the face from actual pictures of Ronaldo. In deepfakes, both networks share the same encoder but are trained with different decoders. Thus, we now have two networks that have learnt how Ronaldo looks like in the game and in real life.

2.jpg 5

2.jpg 5

  1. First autoencoder network learning from FIFA graphics

3 2

Second autoencoder network from learning actual pictures

When training using a pre-trained model on other faces, the total loss goes down from around 0.06 to 0.02 within 4 hours on a GTX 1070. In my case, I continued training on top of the original CageNet model that has been trained to generate Nicolas Cage’s face.

Using the trained models to swap faces

Now comes the fun part. The algorithm is able to swap faces by adopting a clever trick. The second autoencoder network is actually fed with the input of the first one. This way, the shared encoder is able to get the encoding from FIFA face, but the decoder reconstructs the real face using this encoding. Voila, this setup just converted the face from FIFA to the actual face of Ronaldo.

4 2

The second network converting FIFA face to real face of Ronaldo


The GIF below shows a quick preview of results from running this algorithm on faces of other players. I think the improvement is astonishing, but maybe I am biased, so you be the judge.


5 1


6What if you could play “The Journey” mode of the game as yourself instead of playing as Alex Hunter? All you got to do is upload a minute long video of yourself and download the trained model in a few hours. There you go, you may now play the entire Journey mode as yourself. Now that’d be some next level of immersive gaming!

Where it excels and where it needs more work

The biggest advantage I feel we get with this approach is the amazing life-like faces and graphics that are hard to distinguish from the real world. All of this can be achieved with only a few hours of training, compared to years taken by game designers with the current approach. This means game publishers can come out with new titles much faster rather than spending decades in development. This also means that the studios can save millions of dollars that could now be put into hiring decent story-writers.

The glaring limitation so far is that these faces have been generated post facto, like CGI in movies, while games requires them to be generated in real time. However, one big difference is that this approach does not require any human intervention for generating results once a model has been trained, and the only thing holding it back is the computation time required in generating the output image. I believe it is not going to be very long before we have light weight, not-too-deep generative models that can run very fast without compromising output quality, just like we now have YOLO and SSD MobileNets for real-time object detection, something that wasn’t possible with previous models like RCNNs.


If someone like me, who has no experience in graphics designing, can come up with improved faces within just a few hours, I truly believe that if game developers were to invest heavily in this direction it could change the face of gaming industry (yes, intended) in the not-too-distant future. Now if only anyone from EA sports was reading this…


Building a Deep Neural Network to play FIFA 18

Author: Chintan Trivedi

Linkedin: https://www.linkedin.com/in/chintan-trivedi-78665774/


A.I. bots in gaming are usually built by hand-coding a bunch of rules that impart game-intelligence. For the most part, this approach does a fairly good job of making the bot imitate human-like behavior. However, for most games it is still easy to tell apart a bot from an actual human playing. If we want to make these bots behave more human-like, would it help to not build them using hand-coded rules? What if we simply let the bot figure out the game by making it learn from looking at how humans play?

Exploring this would require a game where it is possible to collect such data of humans playing the game ahead of developing the game itself. FIFA is one such game that let me explore this. Being able to play the game and record my in-game actions and decisions allowed me to train an end-to-end Deep Learning based bot without having to hard-code a single rule of the game.

The code for this project along with the trained model can be found here:https://github.com/ChintanTrivedi/DeepGamingAI_FIFA.git

Mechanism for playing the game

The underlying mechanism to build such a bot needs to work without having access to any of the game’s internal code. Good thing then that the premise of this bot says we do not want to look at any such in-game information. A simple screenshot of the game window is all that is needed to feed into the bot’s game engine. It processes this visual information and outputs the action it wants to take which gets communicated to the game using a key-press simulation. Rinse and repeat.

4 1

4 1

Now that we have a framework in place to feed input to the bot and to let its output control the game, we come to the interesting part: learning game intelligence. This is done in two steps by (1) using convolution neural network for understanding the screenshot image and (2) using long short term memory networks to decide appropriate action based on the understanding of the image.

STEP 1: Training Convolution Neural Network (CNN)

CNNs are well known for their ability to detect objects in an image with high accuracy. Add to that fast GPUs and intelligent network architectures and we have a CNN model that can run in real time.

5 2

5 2

For making our bot understand the image it is given as input, I use an extremely light weight and fast CNN called MobileNet. The feature map extracted from this network represents a high level understanding of the image, like where the players and other objects of interest are located on the screen. This feature map is then used with a Single-Shot Multi-Box to detect the players on the pitch along with the ball and the goal.



STEP 2: Training Long Short Term Memory Networks (LSTM)




Now that we have an understanding of the image, we could go ahead and decide what move we want to make. However, we don’t want to look at just one frame and take action. We’d rather look at a short sequence of these images. This is where LSTMs come into picture as they are well known for being able to model temporal sequences in data. Consecutive frames are used as time steps in our sequence, and a feature map is extracted for each frame using the CNN model. These are then fed into two LSTM networks simultaneously.

The first LSTM performs the task of learning what movement the player needs to make. Thus, it’s a multi-class classification model. The second LSTM gets the same input and has to decide what action to take out of cross, through, pass and shoot: another multi-class classification model. The outputs from these two classification problems are then converted to key presses to control the actions in the game.

These networks have been trained on data collected by manually playing the game and recording the input image and the target key press. One of the few instances where gathering labelled data does not feel like a chore!

Evaluating the bot’s performance

I don’t know what accuracy measure to use in order to judge the bot’s performance, other than to let it just go out there and play the game. Based on only 400 minutes of training, the bot has learned to make runs towards the opponent’s goal, make forward passes and take shots when it detects the goal. In the beginner mode of FIFA 18, it has already scored 4 goals in about 6 games, 1 more than Paul Pogba has in the 17/18 season as of time of writing.

Video clips of the bot playing against the inbuilt bot can be found on my YouTube channel, with the video embedded below.


My initial impressions on this approach of building game bots are certainly positive. With limited training, the bot has already picked up on basic rules of the game: making movements towards the goal and putting the ball in the back of the net. I believe it can get very close to human level performance with many more hours of training data, something that would be easy for the game developer to collect. Moreover, extending the model training to learn from real world footage of matches played would enable the game developers to make the bot’s behavior much more natural and realistic. Now if only anyone from EA sports was reading this…