## A Tutorial on Linear Algebra and Geometry ( Part 2)

Author: Paolo Caressa

Points, vectors and their algebra

So far we dealt with points as identified with pairs (or triples for space) of real numbers: as far as Machine Learning is concerned, we are interested infinite sets of points, which represent particular objects to classify or phenomena to correlate as points in a Cartesian space. However, to find out some regular behaviour or pattern for those points, often we are interested in lines, planes and also more complex geometric figures, as conics or quadrics.

We have already seen, by an example, that some geometric figures may also be represented by single equations or systems of equations. Let us bound ourselves to lines and planes to understand in general how to represent them since often those representations are used in Machine Learning.

A line is, intuitively speaking, a set of collinear points: one of Euclid’s axioms claims that given two distinct points there is exactly one line passing through them so that to identify uniquely a line two of its points suffice.

To understand how to do that in the Cartesian plane, let us consider two points P ≠ Q, whose coordinates are P=(x1,y1) e Q=(x2,y2). Let us define the vector from P to Q as the pair of real numbers

Thus a vector is by definition the difference between two points, whereby “difference” we mean the coordinate-wise difference.

From the IT point of view, we may imagine a point as an array (or list) with two elements (which are numbers). Usually an array A = [x,y] is indexed starting by 0, thus A[0] = x and A[1] = y. Then, if A and B are arrays, the vector     may be represented by the array [B[0] – A[0], B[1] – A[1]].

Why do we call “vector” a difference of points? Because it is an object which has direction, sense and magnitude, arrays as vectors used in Physics.

Its direction is just the line passing through P and Q; its orientation is from P toward Q (the opposite vector would be      which runs from Q toward P: which are its coordinates?)

The magnitude is the distance between points P and Q. Therefore a vector determines a line, and a sense on it: of course there are infinite vectors sharing the same direction (the same line).

For example, the line passing through    and    is determined by the vector , but also by the vector   which has the same direction but different magnitude.

In general, for each number, a ≠ 0 the vector    induces this line. We say that two vectors are paralleli if one is multiple of the other one by a non-zero scale factor.

Indeed, another way to determine a line is to pick one of its points    and to provide a vector    so to be able to express all the line’s points as translated by vectors parallel to .

This is called the parametric equation of the line since it expresses the generic point  on the line via a parameter  which varies through all real numbers.

Notice that we used the operation of the sum between a point and a vector, defined as follows:

Although  vectors are useful to describe lines (but also planes and, in higher dimension, “hyperplanes”), they are interesting in themselves because of their algebraic properties, such as:

1. Vectors may be added to get new vectors according to the rule

By a vector space, we mean a set of elements equipped with two operations, sum between elements and multiplication by a number, which satisfies the previous properties shared by vectors in the Cartesian plane as defined as differences of points.

Actually, in Machine Learning, one always uses finite-dimensional vector spaces whose vectors are expressed in coordinates, thus vectors are identified with n-tuples of numbers (their coordinates). This is a major element of confusion for beginners since both points and vectors in the Cartesian space are represented as a pair of numbers.

Actually, in Machine Learning, one always uses finite-dimensional vector spaces whose vectors are expressed in coordinates, thus vectors are identified with n-tuples of numbers (their coordinates). This is a major element of confusion for beginners since both points and vectors in the Cartesian space are represented as a pair of numbers.

However, even if they are represented in the same way, points and vectors are conceptually different objects. A point identifies a single point in the space, while a vector identifies a displacement along a certain direction, with given sense and magnitude: points and vectors may be put in one-to-one correspondence as follows: To a point P it is associated the vector   which starts from the origin and ends in P; to a vector   instead, we may associate the point  .

This distinction between points and vectors is often overlooked but t is nevertheless important, also because it helps in understanding application of the theory: for example, in the following picture some words of a document corpus are represented, mapped by means of a Machine Learning algorithm to points in the plane:

These are points representing words. What about vectors in this case?

Let us take the vector which displaces “pizza” to “Italy” and let us apply it to the point “sushi”: we vet the point “Japan” (up to a certain approximation). We infer that that vector represents an answer to a question: “given a national dish which is the corresponding nation?”.

In this case, it is clear that points and vectors represent distinct concepts.

By applying the “arccosine” function, it is possible to compute this angle starting from the scalar product between two vectors and from their lengths: the geometrical meaning of angle for this quantity is explained via the trigonometric interpretation of the cosine function.

Recall that     is a number greater or equal to -1  and less or equal  1  to , such that:

• If the two vectors have the same direction and the same sense then the cosine is equal to 1.
• If the two vector has the same direction ut opposite since then the cosine is equal to -1 .
• If the cosine is equal to zero, the two vectors are said to be orthogonal.

Therefore, while the distance measures the nearness of two points, which the more are near the more their distance is zero, the cosine measures the similarity among directions and senses of the two vectors, so that the more the two vectors are aligned on the same line the more the absolute value of their cosine is near to one.

This cosine similarity is often employed in Machine Learning to classify objects in vector spaces.

The N-dimensional Cartesian space

So far, in the crash course on Cartesian geometry and linear algebra, we bounded ourselves to the case of dimension N=2 and N=3, so to be able to draw pictures and to develop the concept in a somewhat familiar environment: but in real life applications, the dimension of vector spaces may be quite high.

In those cases, we have to give up to geometric intuition, although all formulas and concept developed so far generalize trivially to the case of higher dimensions.

Indeed such formulas depend on sum and subtractions of points coordinates and vector components: if those coordinates are 2 or 20,000 it makes no real difference. Better still, all the theory is easily and efficiently implemented on a computer, which may deal with points and vectors in spaces of high dimensions without problems, since they are represented as arrays of numbers.

For example, let us consider the concept of a hyperplane: in the plane, this coincides with the concept of line and in the space with the concept of a plane.

Let us fix a dimension N and let us consider the Cartesian space RN.

A line in this space is, as in dimension 2, determined by a pair of points, or by a point and a vector: the parametric equation is just the same as in the dimension 2 case.

If N > 2 we may also consider parametric equations of the following form:

In this case, we have two parameters which vary independently, so that, intuitively, the set of points X which satisfy this equation, when a and b vary in all real numbers, corresponds to the set of pairs (a,b) thus to the plane. In particular, it is a bidimensional object.

Actually this is not always true: for example if   then the parametric equation becomes

which actually describes a line.

Therefore, if we write a parametric equation with many parameters, the dimension of the set of points described by this equation depends on the relations between the vectors which appear in the parametric equation itself. In the previous case, if vectors     and   are parallel then the equation represents a line and not a plane.

If in general, in the N-dimensional space, we write a parametric equation with N – 1 parameters, we get:

Matrices and their algebra

A major feature of linear algebra is the efficiency and universality of its numerical methods. Actually, it suffices to implement a single algorithm (or one of its variants), namely Gauss’ elimination, to be able to do practically everything in an effective way (solving equations, checking linear independence, etc.). These algorithms are available in each standard numerical computation library, such as Python’s numpy.linalg.

To close this tutorial (already far too long) it is worth to introduce the key notion which is involved in each of those algorithms, and which is also crucial in the conceptual developments of linear algebra: the concept of a matrix.

A matrix is just a table of numbers, and it may be considered as a bidimensional array. More formally, an n×m matrix is a table of numbers which are singly addressed by means of two indexes, i and j, where the first one addresses the row and the second one addresses the column. At the crossing of row I and column j, there is the number  which is pointed by those indexes (notice that in mathematics indexes usually runs from 1 on, not from 0 on as in computer science).

The tabular representation of a matrix is as follows:

A matrix such as n = m is said to be a square matrix.

In practice, a matrix is just a vector of length nm whose elements are displayed by rows and not as a single sequence. However, this notation change is fundamental to use those objects.

In particular, matrices enrich vector algebra with a new operation, a multiplication. In the first place let us notice that we may add matrices and multiply them by a single number to get again matrices of the same type:

Therefore, n×m matrices do form a vector space of dimension nm.

Recall that a vector may be multiplied by a number, to get a new vector and that two vectors may be multiplied to get a number (via the dot product). But we do not know how to multiply, say, a vector by itself to get a new vector.

If we write vector as matrices we can actually multiply them: indeed we may distinguish two kinds of vectors when written in matrix form, row vectors and column vectors. A row vector is a sequence of numbers, we have already met them, for example (1,2,0).

A column vector is a vector written from top to bottom as in

At first sight, this is just a matter of notation, but actually, if we interpret a vector as a particular kind of matrix, a row vector is a 1×N matrix, while a column vector is an N×1 vector.

Now, given an n×m matrix A and n×r matrix B, we may multiply A by B to get a new n×r matrix. The entry in the matrix AB addressed by indexes i and j is defined as:

Notice that this is the dot product of the row vector given by the i-th row in A by the column vector given by the j-th column in B.

Example: let us multiply a 2×3 matrix times a 2×3 matrix:

Now we come back to vectors: we may multiply a row vector times a column vector to get a 1×1 matrix (which is just a number) and this is the dot product. But we can also multiply an N×1 column vector times a 1×N row vector as matrices and get an N×N matrix, as in:

However in this way, we multiply two vectors belonging to N-dimensional spaces and we get a vector in a vector space with a different dimension, either 1 or N×N.

The identity matrix is the square matrix whose entries are zero but for the one on the diagonal which are 1 (diagonal elements in a matrix are the ones with the row index equal to the column index:  ). For example the 3,,×3 identity matrix is

As the name suggests, on multiplying a matrix A times the identity matrix still we get A. Moreover, the matrix product is both associative and distributive with respect to matrix sum.

However, matrix algebra displays a particular and interesting feature: the matrix product is not commutative in general. Thus AB is generally different from BA (indeed BA may well be meaningless, for example, if n m).

For example:

Another typical operation is the multiplication of an n×m matrix times a column vector with m components: the result is a column vector with n components.

## Introduction to Dimensionality Reduction and Linear Algebra basics (part 2)

Author : Matteo Alberti

Sommario

Dimensionality Reduction in a linear space. 1

Through the identification of a subspace. 1

Reduction through matrix approximations. 1

Basic case: Decomposition in Singular Values (SVD). 1

Matricial Norms. 1

Vectorial Norms. 1

Induced Norms. 1

Schattern Norms. 1

Frobenius Norms. 1

Base Cases: Cluster Analysis. 1

Definition of a Metrics. 1

Minkowski Distance (Manhattan, Euclidean, Lagrange).

# Matricial Norms

At this point we have set the problem of dimensionality reduction of data as an approximation problem between matrices, we must now evaluate and then calculate the distance between the matrix of the original and the approximate data through the study of the different standards:

There are three main types of rules:

• Vector norms
• Induced norms
• Schatten norms

Where we essentially refer, exceptions excluded, to the Frobenius norm (Euclidean distance)

Elements of algebra:

Norm

A standard (commonly marked with ‖ ∙ ‖) is a function from the vector space fo matrix if:

## Vectorial Norms

The vector norms family treats the  array as a vector of nk components where we can define the norm using any of the following rules:

Note:

Setting p = 2 we are connected to the Euclidean norm

## Induced Norms

An    matrix can be seen as a linear operator from    .

Measuring in   the lengths with a fixed norm and doing the same in   , with a different norm, we can go to measure how  much X lengthens or shortens a vector      , comparing the norm of v with the relative norm of his image Xv.

The induced norm      is defined as:

## Schatten Norms

The Schatten norm, of order p, of an X matrix, is simply given by:

Where   are singular values

## Frobenius Norms

The Frobenius norm of our    matrix is given by:

Explicating the matrix product we obtain:

It corresponds that the Frobenius norm is equal to the sum of square roots of the square.

A Euclidean norm is seen as a vector coincides with the vector rule of X of order 2.

Elements of algebra:

Trace

The trace operator, indicated by Tr (∙), is defined as the sum of the diagonal elements of the argument matrix

# Base Cases: Cluster Analysis

Cluster analysis is a multivariate analysis technique through which it is possible to group statistical units, to minimize the internal “logical distance” to each group and to maximize the one between the groups.

It is among the unsupervised learning techniques.

It is therefore spontaneous to have to define what is meant by logical distance and based on which metric.

## Definition of a Metrics

If it contrarily enjoys only the first three properties, we can define it as an index of the distance

## Minkowski Distance (Manhattan, Euclidean, Lagrange)

At this point we are going to analyze the main cases of distances belonging to the Minkowski distances family where:

We highlight the following cases:

We highlight the following cases:

•   Manhattan distance
•   Euclidean distance
•   Lagrangian distance (Čebyšëv)

As we can see:

Therefore, starting with the example of Cluster Analysis, it is essential to define the type of distance with which we want to deal with our analysis.

Mainly in the packages already implemented are the three variants of Minkowski distances (for quantitative variables)

Importing from sklearn:

AgglomerativeClustering(n_clusters=2, affinity=’euclidean’, memory=None, connectivity=None, compute_full_tree=’auto’, linkage=’ward’

ecuation 27

## Introduction to Dimensionality Reduction and Linear Algebra basics (Part 1)

Author: Matteo Alberti

Sommario

Dimensionality Reduction in a linear space. 1

Through the identification of a subspace. 1

Reduction through matrix approximations. 1

Basic case: Decomposition in Singular Values (SVD). 1

Matricial Norms. 1

Vectorial Norms. 1

Induced Norms. 1

Schattern Norms. 1

Frobenius Norms. 1

Base Cases: Cluster Analysis. 1

Definition of a Metrics. 1

Minkowski Distance (Manhattan, Euclidean, Lagrange). 1

# Dimensionality Reduction in a linear space

The aim of this first tutorial is to introduce the basic notions of dimensionality reduction from the mathematical point of view (spaces, subspaces, linear maps) and to retrieve the necessary elements of linear algebra (norms, isometry, isomorphism) for each machine learning algorithm.

Therefore, we are going to give a gentle introduction to what a linear reduction is. First of all, we need to identify the concept of “optimal” subspaces.

We have a set of k-dimensional vectors    (n statistical units on k covariate) where we want to identify a p-dimensional subspace    in  and a linear map    from    to   such that the relative images   sufficiently approximate our k-dimensional input vectors (in reference to some metric we will define)

We rewrite as:

elements of algebra:

Vector space

We define a vector space on R a set V, whose elements are called vectors , with the following properties:

There exists in V a defined vector addition operation that associates with each    the vector

• Vector addition is commutative and associative
• There is a vector in V , set in 0 and defined as the origin
• Each vector  has its opposite, indicated with –x t.c. + (-x)= 0

Exists in V a defined multiplication operation for the scalars which associates each   and at each   the vector   so that:

• Multiplication between scalars is associative

It also applies:

• Scalar multiplication is distributive concerning vector addition

Vector subspace

S is a non-empty subset of the subspace of V where for each    and every linear  combinations

Kernel and Image

Let V and W be two vector spaces   and be a linear application.

The kernel of L is the set of V vectors whose image is the null vector of W.

This set is indicated with ker L.

The image of L is the set of the vectors of W which are images of some vector belonging to the domain V, that is:

Linear Map

A linear map    between real vector spaces is a function with the following properties:

for every     and

We are going to define only the cases of main relevance:

Let    be a linear map, then f is said:

• monomorphism if f is one-to-one
• epimorphism if f is onto
• isomorphism if f is bijective (injective + surjective)
• endomorphism if V = W
• automorphism if V = W and is bijective.
• rank equal to r if r = dim f (V)

Suppose at this point that we have identified a subspace (specified later)   that is sufficiently “approximate” and is   a base  of    (Note: the base is formed by k-dimensional vectors since is subspace of   )

The linear map  associates to the input   the element     of   , having the following form:

chosen appropriately.

The approximation error from  then from  to  will consist of the appropriate vector norm:

Fixed our base    our vectors   can be found directly from the coordinates   .

Given the map   that associates with the generic vector   its vector of coordinates  and we compose with the map   we obtain a map that associates the coordinates of their images with :

Through our map the k-dimensional input vectors are represented in p-dimensional vectors, i.e. as elements of  .

At this point, it is immediate that we can proceed with further analysis in  .

So let’s investigate how this reduction in dimensionality goes to keep of the original geometry (and to lose)

Going from k to p dimensions with   involves a loss of information and a deformation of the original geometry of the data. Also w,e have to remeber that any kind of linear maps can not be an isomorphism or an isometry.

This means    that all rules, scalar products and distances are not preserved

Example:

dimensional subspace

orthogonal complement of

Each vector can be decomposed into a sum of vectors (and a vector )

So, given two vectors  we have:

Each vector    can be decomposed into a sum of vectors

So, given two vectors    we have:

As can be seen, the scalar product is not preserved.

Elements of algebra:

orthogonal complement

Let     be a subspace of V, we define orthogonal complement of D in V indicating it with t   the subset of V defined by:

That is a subset of all the vectors of orthogonal V to all the vectors of S

The linear map    is an isomorphism the vector spaces   and    , this means that there will be no loss of information.

We consider the Euclidean product and suppose that the base     is orthonormal,    two vectors of    with the relative coordinate vectors:

then results truly the following condition:

It means that the product scalar between h, w is identical to the scalar product of the respective initial coordinate vectors (specify that the two scalar products are defined in different spaces although they have been written in the same way)

If the base was not orthonormal the factors     would not be reduced to 0.

In the first case, the scalar product among the coordinate vectors is given by:    while in the second case it would be given by      , with Gram matrix G it follows therefore that matrices built in the second way and associated with different bases are different from one another (having to readjust the calculation of the scalar products, norms, induced distances …)

Elements of algebra:

orthonormal:

A base is defined as orthonormal when it is composed orthogonal vectors of unitary norms

## Reduction through matrix approximations

We want to give a second vision to the dimensionality reduction based on the matrix approximation (what will be used in practice in the most of programming language)

GIven:

Then we can rewrite it as:

The rows of θ are given by the linear combinations of the rows of B that is from our base, with the coefficients given by the rows of A, the coordinates on the chosen base.

Our problem dimensionality reduction, therefore, corresponds to identifying a vector subspace of dimension p (p <k) our chosen base (B) and of the relative coordinates given by the matrix A.

In data analysis, where our starting point is the data matrix, the different reduction techniques differ according to the type of approximation, decomposition and choice among the many possible bases.

## Basic case: Decomposition in Singular Values (SVD)

Let’s implement in Python a simple decomposition in singular values (SVD) that is, let’s divide our starting matrix X into the two matrices A and B seen previously:

import numpy as np

X = np.array([3,2,4],[3,6,7],[2,1,4])

autoval, autovett = np.linalg.eig(X)

## A Tutorial on Linear Algebra and Geometry ( Part 1)

Author: Paolo Caressa

Foreword

In machine learning and deep learning, as in any other interdisciplinary area, concepts and formalisms coming from different sources and fields are used, and often they require different mindsets to be actually understood and properly managed.

For this reason, one says usually that a data scientist should be a hybrid between a computer scientist, a statistician and a mathematician, but few people are actually well-versed in all of those three subjects, and it is unlikely that all those fields will be familiar, easy and intuitive for the same mind.

In particular, among the different mathematical notions which are truly needed to a data scientist, as for anyone seriously interested in understanding how a machine learning system works (or rather why it actually works), we find linear algebra.

This is quite surprising for many mathematicians, which consider linear algebra so easy to be userful only at a first approximation. But nevertheless, as I.M. Herstein wrote, linear algebra is just easy, useful and beautiful!

The aim of this tutorial is to explain, in a simple but precise way, the basic notions of linear algebra: thus, if you reader already know how to compute eigenvalues, or how to perform a singular value decomposition or how to compute a principal component analysis, this tutorial is not for you. Otherwise, enjoy reading!

The Cartesian plane and space

Let us start from the very beginning, and dig into our high school remembrances about Cartesian (or analytic) geometry.

This approach to geometry was devised in the XVIIth century by Descartes and Fermat, and the basic idea is to provide a numerical model both for the plane and space. Essentially, their discovery consists in identifying the points in the plane and in the space with respectively pairs and triples of numbers, so that the plane is one-to-one mapped into the set of pairs (x,y) of numbers where x and y vary independently and in all possible ways.

To begin with, let us recall that the points of the line correspond to the set of real numbers (thus rational and irrational numbers) totally ordered by the relation <.

Indeed, let us fix a point O on the line and let it correspond to 0, next let us fix a point U (different from O) on the line and let it correspond to 1. All other points are determined uniquely by their positions with respect to those two points:

In particular, it is clear what we mean by the distance between two points P and Q on the line: if x and y are the real numbers corresponding to P and Q, the distance PQ is just |x – y| (absolute value of the difference):

Therefore, on a line, once a point corresponding to 0 and a point corresponding to 1 are fixed we get also a direction for positive numbers: indeed O divides the line in twintosides, and positive numbers are the ones on the side of U, and negative ones are on the opposite side.

Hence:

The line of Euclidean geometry corresponds one-to-one to the set R of real numbers

To each point it corresponds one and only one number, were we may think to a real number as either a decimal number with a finite number of digits after the decimal point, possibly repeating indefinitely, or a decimal number with infinite non periodic digits after the decimal dot (to be precise we should exclude a periodic repetition of the 9 digit).

Next, let us come to the plane.

Fix a point O in the plane and take the line x passing through it, and also a second line y, perpendicular to x and passing through O. On both x and y we choose the same measure unit (thus a point U different from O and state that the distance OU is 1).

Therefore we may measure the distance between two points on the same line x, as we did before.

The correspondence between a point P and a pair (a,b) of numbers is defined as follows:

1. We project P on the x line, getting a point P’, and we put a = distance between P’ and O.
2. We project P on the y line, getting a point P’’, and we put b = distance between P’’ and O.

In this way we uniquely associate P to the pair (a,b): to come back from coordinates (a,b) to points it suffices to find out points P’ and P’’ on lines x and y which have distances a and b from the origin O, to that

In this way we uniquely associate P to the pair (a,b): to come back from coordinates (a,b) to points it suffices to find out points P’ and P’’ on lines x and y which have distances a and b from the origin O, to that we draw the perpendicular to the axes x and y through P’ and P’’ and get the point P as intersection point of these two perpendiculars

.

Hence:

The plane of Euclidean corresponds one-to-one to the set R2 of pairs of real numbers: to each pair, it corresponds one and only one number.

The notation R2 denotes the ordered pairs of elements of R. In general, if X is a set, Xn is the set of n-tuples of elements of  X.

Which advantages does it bring to identify the plane with the pairs of numbers (a,b)?

• In the first place, we know what they are, we need not axioms anymore to describe what points, lines, etc. are.
• Next, we transform geometric problems, involving complex constructions, into algebraic problems, which may be solved by computing.

Thus, Cartesian geometry is the ideal geometry for a computer, since we may reduce logical deduction and cumbersome constructions to numbers and computations. It is not by chance that 2D libraries which are used to draw on a computer screen, or on the window of an application, use coordinates to represent points (plus come attributes for colours, etc.).

For example let us ask: how is a curve represented in the Cartesian plane?

There are two ways:

• By means of a Cartesian equation, thus a single algebraic relations between the coordinates od any point belonging to the curve.
• By means of a parametric equation, which describes the algebraic expression of a generic point in the curve, depending on a real parameter.

For example consider the equation x2+y2 = 1 and the equation 2x+y = 0. The former represents a circumference centred in the origin and with radius 1, the latter a line passing through the origin.

Suppose we want to compute their intersections: on using Euclidean geometry we should find out that intersection by means of a geometric construction somehow, while according to the Cartesian geometry approach we are actually asking: which are the points whose coordinates (x,y) simultaneously satisfy the following relations: x2+y2 = 1 e 2x+y = 0?

The latter equation implies y = –2x, which substituted in the former equation provides x2+(–2x)2 = 1, thus x2+4x2 = 1 hence x2 = 1/5, which has two solutions x = 1/√5 and x = –1/√5. Therefore we’ve found the two required intersection points: (1/√5, –2/√5) and (–1/√5, 2/√5).

One could proceed in this way toward higher dimensions: thus, Cartesian space is the set of triples (x,y,z) of real numbers, thus the set R3.

Therefore, a point in the space is uniquely determined by its three Cartesian coordinates, which we may plug into equations to solve geometric problems after we translated them into algebraic ones.

The distance between two points

We have said that along with a line distance are computed by the absolute value of the difference. How are distances in the plane computed?

Interestingly there are different ways to do that, which are all generalizations of the distance on a line: let us the three such distances, which are all particular cases of the so-called Minkowski distances:

• Euclidean distance
• Manhattan (or taxi-cab) distance
• Uniform distance.

Euclidean distance between two points P = (x1,y1) and Q = (x2,y2) in the plane is computed by the following formula:

Notice that, if both P and Q belong to the x-axis, this distance reduces to the absolute difference of coordinates: indeed, on supposing that P = (x1,0) and Q = (x2,0) the previous formula becomes   (as usual when we take the square root of a number we consider the positive one).

(as usual when we take the square root of a number we consider the positive one).

Euclidean distance should be actually called “Pythagorean distance”, since it may be inferred after Pythagoras’ theorem: indeed, let us consider a right triangle OPQ with a vertex in the origin, one cathetus on the x-axis and the other one parallel to the y-axis.

Therefore the coordinates of P will be (x,0) and the coordinates of Q will be (x,y): the distance between O and P is x and the distance between Q and P is y, so that, by Pythagoras’ theorem, d(O,Q)2 = d(O,P)2 + d(P,Q)2 = x2 + y2. Hence in this case, on taking the square root:

and the formula is proven. In the general case the reasoning is similar, but coordinates of O (which in our case are 0) have also to be taken into account.

Thus: Euclidean distance is just the distance between the end points of a linear segment, and it has a simple geometric description.

Distance among points in space is measured in the same way, just  by adding one more coordinate to points: if P = (x1,y1,z1) and Q = (x2,y2,z2) are points in the Cartesian space then their Euclidean distance is:

For the sake of completeness we explain two more ways to compute distances in the plane, whose outcomes are different from Euclidean distance but which nevertheless are useful to measure when two points are near (a nice theorem which goes beyond the scope of this tutorial claims that as far as the “nearness” relation is concerned, all those distances are equivalent).

The first one is a common distance in Machine Learning, namely the “Manhattan” distance or “taxi-cab distance”:

Notice that, if both P and Q are on the x-line then this distance coincides with the absolute value of their difference. The “taxi-cab” name stems from the fact that this distance does not involve Pythagoras’ theorem, which uses the diagonal shortcut, but starting from P describes the shortest path toward Q set up by horizontal and vertical segments.

foto 11

This is like the path followed by a taxi to reach the opposite point of a building block: the car cannot pass through the block but has to follow its border.

Another distance which may prove to be useful is the “uniform distance”, defined as

Thus this distance is computed by taking the maximum length of the segment given by the projection of the PQ segment on the x- and y-axes: again, if both points lie on the x-axis this distance equals the absolute value of the difference between point’s coordinates.

Given a distance, a (open) ball is the set of points whose distances are less or equal to a fixed one (its ray) from a fixed point O, the center: thus, if O is the center and r the ray then the ball contains all points such that d(P,O) < r. This terminology is self-apparent in the case of the Euclidean distance: as an exercise, try to figure out which subset of the Cartesian space is balls in the taxi-cab distance.

## Derivatives – An intuition tutorial

Author: Davide Coppola

Although sometimes overlooked, math is a fundamental part of machine learning (ML) and deep learning (DL). Indeed, it is the basis on which both disciplines stand: without notions of algebra or calculus they could not exist. A key factor in ML, coming from calculus, is the notion of derivative. But you should not be scared by this concept; it is much easier than you may think!

First of all, let us define a function: it can be thought of as a black box (Fig. 1): a number n of input values or independent variables enters the box; they are processed in a specific way determined by the equation(s) describing the function and finally m new output values or dependent variables exit the box.

Fig. 1: Any function can be seen as a black box where independent variables enter the box and obtain a new value.

For the rest of this tutorial, we will focus on unidimensional functions, i.e. functions that have only one input and one output. Common examples of this kind of functions are:

Where m, q, a, b and c are just numerical coefficients, think of them as any fixed number. 1 Is the equation of a straight line, 2 describes a parabola and 3 is the natural logarithm function. As you can see they all have an independent variable () and a dependent variable (): a function describes the relation that stands between the two variables, thus determines its “shape” in space.

If a function already describes our curve, then why do we need the derivative?

Generally speaking, functions usually are not as straightforward as the examples given above and it might be impossible or impractical to try out all the possible values of the independent variable to understand their behavior. Therefore, the derivative of a function gives additional information on the curve we are studying.

What is a derivative then? A derivative of a function  is another function  , deriving from the original, that describes the variability of , i.e. how the rate of change of a function behaves with respect to the independent variable. The derivative, evaluated at a point , describes how a function is changing in a neighborhood of . For example, if the derivative is positive we can expect for the points following  to have higher values of . This means that the function is growing as increases. Likewise, if the derivative in is negative, the value of the function decreases as increases. Thus, the derivative at a given point indicates the slope of the line tangent to the curve at that point

The slope defines the ratio between the height and the horizontal length, for example, of an inclined plane or a right triangle. You surely have experience of this concept from road signs (Fig. 3). In general, the slope is given by the equation

The rigorous definition of a derivative is, in fact, the limit of the incremental ratio:

This ratio describes the slope of a secant to the curve passing through the points   and  . In fact, the numerator  can be seen as the height of an inclined plane, whose horizontal length is . The limit tells that  should be a number infinitely close to zero, meaning that the distance between the two points becomes practically non-existent. As a matter of fact, what was once a secant becomes a tangent to the curve as can be seen in the animation in Fig. 4.

Fig. 4: As the distance between the two points becomes zero, the points overlap and the secant line becomes a tangent to the curve.

Bear in mind that there are some particular cases where the derivative cannot be defined in one or more points of the function; the function has to be continuous in that point, although continuity alone is not sufficient for the derivative to exist.

Before looking at a simple example let us revise the key concepts; the derivative…

• … represents the variability of the primitive function with respect to the independent variable;
• … of a function is a function;
• … evaluated at any given point, represents the slope of the tangent to the curve at that point.

Fig. 5: A parabola and its derivative. The green and the blue straight lines are the tangents to the curve in the points x=-2 and x=+2. respectively.

In the example (Fig. 5) we have the graphs of a function (f) and its derivative (f’): the former is a parabola, whereas the latter is a straight line. Functions and their derivatives are usually represented with their graphs one above the other; this is because the independent variable is the same and this disposition makes is it easier to understand their relation.

Looking at x<0 , you can see that the derivative is positive, which means that the primitive function is growing with x , i.e. the slope of any tangent line to f for x<0 is positive. However, the value of the derivative is decreasing with a constant rate, meaning that the growth of the value of f is decreasing as well. Consequently, the tangent lines are more and more tending to a horizontal line.

This extreme situation occurs for x=0 , which corresponds to the apex of the parabola and to the point where the derivative is 0 . Points that have a derivative equal to 0 are called critical points or stationary points. They play a crucial role in calculus, and in machine learning as well because they represent the points corresponding to the maxima, the minima and saddle points of a function. Many machine learning algorithms revolve around the search for the minima of a function, reasons why it is important to have a little understanding of derivatives and their meaning.

With x>0 , the derivative is negative and its absolute value keeps growing. This means that the primitive function will decrease in value with x and that its decrease rate will also grow with each step. As a matter of fact, this is exactly what happens to the parabola.

The aim of this intuition tutorial was to give you a general understanding of how a derivative works and its meaning without using too many equations. Of course, A more in-depth and rigorous analysis of this topic is necessary if you want to fully understand more complex matters that arise in machine learning. But don’t be scared, it is not that complicated after all!

Fig. 2, 3 and 4 are taken from Wikipedia.