Long-term memory neural network 1 – Introduction

Author: Daniele D’armiento

Cognitive skills, such as prediction, reckoning, ability to answer questions and to undertake actions, all involve retrieval of previously stored information.

The actual challenges in the development of an Artificial Intelligence reside in being able to both store big data in memory and quickly retrieve them.

But it is no news that computers are able to store a huge amount of data (today we estimates that all world data is much more than 1 ZettaByte, that is >> 1021 Bytes), and there is also nothing surprising about the existence of large databases and SQL queries for every need.

 Moreover, the human brain has not a stable memory as the silicon one but thanks to this deficiency, our lack of stability and reliability, we can intuitively process big data and retrieve information. In this way we can overcome the so-called “curse of dimensionality”.

No research has already unveiled the mysteries of the human brain but nowadays Deep Learning breakthrough brings us closer to a finer description of what intelligence is.

It has developed a model (which originates from biological neural networks) that is able to learn different signals encoded in images and sounds, to classify them and to build inner representations, in order to organize big amount of data and quickly recover informations distributed among all nodes of the network (opposite to the old style which stored data in precise memory addresses). This is completely automatic without the need of any sequential instructions or algorithms.


In the late 80’s, well before Deep Learning come, computers speed was very low compared to nowadays. That is because low speed requires high processing time. It is obvious but not trivial since nobody would start experiments and simulations which could take such a long time. Thus the key to unveiling that world already existed.

 The learning algorithm is, today and yesterday, BackPropagation which, together with Gradient Descent, allows us to find a better approximation of the network weights, reaching a minimum error relative to training data. It requires many update steps and many sample data from which to learn. All this is quite a massive calculus and a big time consumption.

This is an example of the error hypersurface, in the training’s parameters space: the SGD algorithm searches for the best, minimum error parameters.
 Source: https://thoughtsahead.com/2017/01/27/machine-learning-series-introduction-to-machine-learning-linear-regression-and-gradient-descent/

For this reason, old models had to be lightweight, and couldn’t exploit the big data necessary to obtain better performances, so it was the custom of that period to build shallow architectures with fewer parameters and with one or two neuron layers.

However, a more complex model is necessary to achieve a higher level of abstraction. It is also needed to store data in complex structures able to preserve a huge number of observed characteristics and, at the same time, able to generalize which means to recognize that features in objects never observed before.

So, we need to store more data not as a simple repetition, but rather as the “Eureka!” light bulb which occurs when we comprehend a new unifying scheme that describes many sides of the same entity. Sides that previously seemed uncorrelated.

Left: a “Shallow” neural network, a one hidden layer network.  Right: a “Deep” neural network, with many hidden layers.  Source: https://www.quora.com/What-is-the-difference-between-deep-and-shallow-neural-networks


If a Deep Learning model is able to recognize objects in images, to understand words spoken by humans, or answer to written questions in a way that does make sense, then it means that the model can grasp the meaning without retrieving that from a database. It synthesizes a concept by its own the same way we do.

 This is possible thanks to the “Deep” structure which enables us to store more information than the former pre-2010 models.

GoogLeNet: a neural network used by Google for image recogition.
Source: https://research.googleblog.com/2017/05/using-machine-learning-to-explore.html

Regarding linguistic models for translation, for NLP and NLU, as well as the conversational one, a big step forward was possible thanks to Deep Learning.

 The performances can be measured but it is straightforward to realize

only by intuition how a certain model gives us a good translation, or a decent chatbot assisting for a service.

When it does good, the cause of this intelligent behaviour is something different from a lookup table or a simple, quick algorithm; it needs a long-term memory to link words and phrases distant in time and semantic space. An n-gram statistics will certantly fail to do that.

A neural network with Episodic memory: Dynamic Memory Network (DMN)

Source: https://yerevann.github.io/2016/02/05/implementing-dynamic-memory-networks/


There are many types of neural networks for language models, ranging from Recurrent Neural Network (RNN) to Convolutional Neural Network (CNN).

We are going to show a new promising model: the Dynamic Memory Network (DMN).

This model is trained on elements which consists of several input phrases, a question

and an answer each, and its strength lies in the so-called Episodic Memory which can perform a multi-step process of phrases from which it builds a contest and extracts the required information.








Capsule Neural Networks

Author: Matteo Alberti



CNN highly perform when our validation set is very close to the training data but if the images have rotation, tilt or commonly any kind of affine transformations than CNNs have poor performance. As we have seen, until now, the main resolution was given adding more layers (due to improve generalization) where innermost layers extract simple features like small curves and edges while higher up layers combine simple features into complex one.


To achieve this computation in reasonable time we use max-pooling operator after each convolutional layers. However, this operator loses out the positional data creating a sort of positional invariance



So the recent advancement of Capsule Networks is to achieve not invariance (small tolerant to changes) but Equivariance (total invariance and accordingly adaptively) replacing max-pooling operator with a new one: dynamic routing.





A capsule is a group of neurons.  The activity vector of a capsules represents the instantiation parameters when an object (or a part of it) is detected by the network.

The length of this activity vector will represent the probability of existence while the orientation of the vector encodes all Pose-Matrix information (rotation, translation).



Pose Matrix draws inspiration from rendering (computer graphics) that is, the construction of an image starting from a hierarchical representation of geometric data.  In this case the goal is to build a sort of inverse-rendering:


The image is deconstructed on the basis of a hierarchical representation matching it with extracted features”


When a capsule at lower level is active makes predictions to higher-level capsule, via transformation matrices. If multiple predictions agree a higher level capsule becomes active. These result is achieve replacing max-pooling with iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.






The simple Capsule Network architecture is very shallow, only two convolutional layers and one fully-connected.


First Convolutional layer has 256 kernels (9×9, stride equal to one) with a ReLu activation. This because we need a spatial reduction and convert pixel intensities to the activities of local features detectors used ad inputs to the primary capsules.

Primary capsules are the lowest level of multidimensional entities and corresponds to the inverse-rendering process.

Second layer of primary capsules is a convolutional one, 32 channels of 8 dimensional capsules where each primary capsules has 8 convolutional units (9×9 kernels with stride equal to 2). Each capsule corresponds to a features. Primary Capsules gives 32x6x6 capsule output of 8D vector where inside the 6×6 grid capsules share their weights.

Final layer, called DigitCaps, produce a 16D capsule, one for each target class (like softmax but neurons are replacing by capsules)


So the main differences are given by:


  • Replace neurons with capsules (and so scalar with vectors)
  • A new activation function (able to works with vectors)
  • Replace max-pooling with dynamic by routing




Squash Function


As we have seen we want that length of an activity vector represents the probability of existence, therefore we use a non-linear function called “squashing” that do a sort of normalization from zero to one.


Vectors near to zero will be shrunk to almost zero while long one slightly below to one.


 keep the orientation of the extracted features





Dynamic Routing


Dynamic Routing isn’t just a smarter max-pooling operator that preserve spatial information but also give us the opportunity to preserve Hierarchy of the parts, this is an alternative way to do forward pass.


Dynamic Routing is an Iterative Process that connect capsules with similar activities.


2.jpg 1

In every iteration each capsule compute a matrix multiplication with the output of the capsules keeping only with higher norm and similar orientation.






So  is given by:  

Where  is a not negative scalar. For each capsule at below level i, the sum of all  is equal to one ( ) and for each capsule the number of weights is equal to the number of capsule at upper level.


So with dynamic routing we first connect each capsules at one level with all capsule at upper layer, like FC layer, then due to the iterative process we create a routing between capsules like a sort of intelligent dropout, keeping only the most likely connections.




Training of weights and Loss Function


When the forward pass is completed and fixed we are going to training our weights :

We have to neural networks to do this step:


  1. First network, during backpropagation step, maximize the norm of the capsule of the target (L^i _\mu )
  2. Another version of the network use MLP (like Autoencoding) due to reconstruct the input-target image. ( L^i _p)


So the goal of this two networks will be to minimize the following function:

Loss Function: L^i = L^i_\mu + \rho L^i_p  where \rho  is weighed by a small coefficient p (p=0.005)




Reconstruction and Regularization


As we seen in training of weights we will use an MLP as secondary network:

During Training step, we Mask out all except the activity vector of the correct digit capsule (one class predicted) than, while activity vector contains all information from Pose-Matrix, we reconstruct the input image.



The output of the digit capsule is fed into a decoder (MLP) that model the pixel intensities, then we minimize the sum of squares differences between the output and the pixel intensities.







Until now Capsule Networks, while achieved the state-of-art in image classification, are tested only on MNIST data. But there are some relevant considerations to do analysing not only the accuracy achieved but many factors and proprieties of the model.

We compare Capsule Networks with 2 different combination of parameters:

  • number of routing process
  • Reconstruction (without reconstruction Loss Function will become: = )

And one optimized convolutional neural network (Baseline)


While Capsule Network achieve the best results the most relevant information is given by:


Capsule Networks CNN (Baseline)
Number of Parameters 8.2M 34.4M
Affine Transformation Robustness 79% 66%


The number of parameters is more than 4 times lower and the proprieties of equivariance of capsules allow us to achieve best results on affine transformation, due to improve generalization.


Summing Capsule Proprieties:


  • Allows a hierarchy of parts and spatial information
  • Robustness to Affine Transformations
  • Need less data due to generalizations proprieties
  • High discriminatory capacity on overlapping objects


But Caps Net are just not tested on other dataset (large datasets like ImageNet) and routing-by-agreement process slows down the training step


Derivatives – An intuition tutorial

Author: Davide Coppola



Although sometimes overlooked, math is a fundamental part of machine learning (ML) and deep learning (DL). Indeed, it is the basis on which both disciplines stand: without notions of algebra or calculus they could not exist. A key factor in ML, coming from calculus, is the notion of derivative. But you should not be scared by this concept; it is much easier than you may think!

First of all, let us define a function: it can be thought of as a black box (Fig. 1): a number n of input values or independent variables enters the box; they are processed in a specific way determined by the equation(s) describing the function and finally m new output values or dependent variables exit the box.

Fig. 1: Any function can be seen as a black box where independent variables enter the box and obtain a new value.

For the rest of this tutorial, we will focus on unidimensional functions, i.e. functions that have only one input and one output. Common examples of this kind of functions are:

y = mx + q

y = ax^2 + bx + c

y = ln(x()

Where m, q, a, b and c are just numerical coefficients, think of them as any fixed number. 1 Is the equation of a straight line, 2 describes a parabola and 3 is the natural logarithm function. As you can see they all have an independent variable () and a dependent variable (): a function describes the relation that stands between the two variables, thus determines its “shape” in space.

If a function already describes our curve, then why do we need the derivative?

Generally speaking, functions usually are not as straightforward as the examples given above and it might be impossible or impractical to try out all the possible values of the independent variable to understand their behavior. Therefore, the derivative of a function gives additional information on the curve we are studying.



What is a derivative then? A derivative of a function  is another function  , deriving from the original, that describes the variability of , i.e. how the rate of change of a function behaves with respect to the independent variable. The derivative, evaluated at a point , describes how a function is changing in a neighborhood of . For example, if the derivative is positive we can expect for the points following  to have higher values of . This means that the function is growing as increases. Likewise, if the derivative in is negative, the value of the function decreases as increases. Thus, the derivative at a given point indicates the slope of the line tangent to the curve at that point


The slope defines the ratio between the height and the horizontal length, for example, of an inclined plane or a right triangle. You surely have experience of this concept from road signs (Fig. 3). In general, the slope is given by the equation


Fig. 3: Road sign indicating the slope of the road.

The rigorous definition of a derivative is, in fact, the limit of the incremental ratio:

This ratio describes the slope of a secant to the curve passing through the points   and  . In fact, the numerator  can be seen as the height of an inclined plane, whose horizontal length is . The limit tells that  should be a number infinitely close to zero, meaning that the distance between the two points becomes practically non-existent. As a matter of fact, what was once a secant becomes a tangent to the curve as can be seen in the animation in Fig. 4.

Fig. 4: As the distance between the two points becomes zero, the points overlap and the secant line becomes a tangent to the curve.

Bear in mind that there are some particular cases where the derivative cannot be defined in one or more points of the function; the function has to be continuous in that point, although continuity alone is not sufficient for the derivative to exist.

Before looking at a simple example let us revise the key concepts; the derivative…

  • … represents the variability of the primitive function with respect to the independent variable;
  • … of a function is a function;
  • … evaluated at any given point, represents the slope of the tangent to the curve at that point.


Fig. 5: A parabola and its derivative. The green and the blue straight lines are the tangents to the curve in the points x=-2 and x=+2. respectively.

In the example (Fig. 5) we have the graphs of a function (f) and its derivative (f’): the former is a parabola, whereas the latter is a straight line. Functions and their derivatives are usually represented with their graphs one above the other; this is because the independent variable is the same and this disposition makes is it easier to understand their relation.

Looking at x<0 , you can see that the derivative is positive, which means that the primitive function is growing with x , i.e. the slope of any tangent line to f for x<0 is positive. However, the value of the derivative is decreasing with a constant rate, meaning that the growth of the value of f is decreasing as well. Consequently, the tangent lines are more and more tending to a horizontal line.

This extreme situation occurs for x=0 , which corresponds to the apex of the parabola and to the point where the derivative is 0 . Points that have a derivative equal to 0 are called critical points or stationary points. They play a crucial role in calculus, and in machine learning as well because they represent the points corresponding to the maxima, the minima and saddle points of a function. Many machine learning algorithms revolve around the search for the minima of a function, reasons why it is important to have a little understanding of derivatives and their meaning.

With x>0 , the derivative is negative and its absolute value keeps growing. This means that the primitive function will decrease in value with x and that its decrease rate will also grow with each step. As a matter of fact, this is exactly what happens to the parabola.

The aim of this intuition tutorial was to give you a general understanding of how a derivative works and its meaning without using too many equations. Of course, A more in-depth and rigorous analysis of this topic is necessary if you want to fully understand more complex matters that arise in machine learning. But don’t be scared, it is not that complicated after all!

Fig. 2, 3 and 4 are taken from Wikipedia.

Linear Dimensionality Reduction: Principal Component Analysis

Author: Matteo Alberti




Among all tools for the linear reduction of dimensionality PCA or Principal Components Analysis is certainly the main tools of Statistical Machine Learning.

Although we focus very often on non-linearity, the analysis of the principal components is the starting point for many analysis (also the core of preprocessing), and their knowledge becomes imperative in case the conditions on linearity are satisfied.

In this tutorial we are going to introduce at the mathematics level the extraction of PC, their implementation with python but above all their interpretation.


This is done by dividing the total variance into an equal number of starting variables than it will be possible to reduce the number based on the contribution that each Principal Component provides in the construction of our total variance.

We would like to remind you that the application of the PCA is useful when the starting variables are not independent


Let’s introduce them to the correct mathematical formalism:

Given a set of p quantitative variables X1, X2,. . . , Xp (centred or standardised variables) we want to determine a new set of k variables t.c k≤p indicated with Ys (s = 1,2, … k) that have the following properties:


uncorrelated, reproduce the largest possible portion of remaining variance following the construction of the first s-1 components (increasing order) and average equal to zero.

As a result, the linear combination will be:

We must, therefore, find the coefficients v that satisfy these constraints. This is a problem of maximum constraint where the first is called Normalization:

Our system becomes:

Where Variance can be written as:

And we can solve with Lagrange multiplier:

Calculate the gradient of L1 and its annulment:

The system : admits infinite solutions (which respect the constraint) by lowering the rank of the system coefficient matrix

which correspond to λs called eigenvalues of Σ.


Similarly, for the construction of the second PCA (and so for all the others) the Orthogonality Constraint replaces our system, given by our request that the PCs be uncorrelated, expressed as follows:

Than by setting the function of Lagrange in p + 2 variables:

From which we obtain the second eigenvalue (Y2) where we remember the following property:

Principal Component proprieties:


Each eigenvalue of Σ has a role in the variance of the respective PC

positive semidefinite

Total Variance

Generalized Variance



For the choice of the number k (with k <p) of PC to be maintained in analysis there is no universally accepted and valid criterion. It is therefore good practice to use them jointly and always keep the needs of the analysis in mind. We want to expose the main ones:



Cumulative percentage of total variance reproduced



absolute misure                                Normalization of Var                  % cumulative

set a threshold T on the cumulative percentage and keep the first k PC in the analysis that guarantees the exceeding of the threshold



It represents the eigenvalues concerning the order number of the component

Where the first k PC is selected based on the slope reduction. In this specific case, the PCs to be kept in analysis would be the first two.




Kaiser Criterion

Eigenvalue criterion greater than 1 (valid only for standardized variables)



Let’s go to implement with Python:

We have to import the necessary packages from scikit-learn

import numpy as np

from sklearn.decomposition import PCA

The class has the following attributes:

Sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None)

We want to comment the main parameters:

  • n_components = number of components to be analyzed.
  • svd_solver = gives us some of the main alternatives. We want to remember that PCA does not support sparse data (for which you will need to load TruncatedSVD)


We are going to test it on new real data, we predict for example on the Wine data that can be imported through the script:

from sklearn import datasets

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn import decomposition


wine = datasets.load_wine()

X = wine.data

y = wine.target

fig = plt.figure(1, figsize=(5, 4))


ax = Axes3D(fig, rect=[1, 0, 1, 0.9], elev=30, azim=222)


pca = decomposition.PCA(n_components=None)


X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:

ax.text3D(X[y == label, 0].mean(),

X[y == label, 1].mean() + 1.5,

X[y == label, 2].mean(), name,

bbox=dict(alpha=.5, edgecolor='b', facecolor='w'))

# Reorder the labels to have colors matching the cluster results

y = np.choose(y, [1, 2, 0]).astype(np.float)

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral,






This will our result:

Installation of Keras/Tensorflow – Theano on Windows

Authors: Francesco Pugliese & Matteo Testi


In this post, we are going to tackle the tough issue of the installation, on Windows, of the popular framework for Deep Learning “Keras” and all the backend stack “Tensorflow / Theano“.

Installation starts from the need to download the Python 3 package. Let us choose Miniconda and download it at the following link: https://conda.io/miniconda.html that will show the following screen:



Select Python 3.6 and the operating system version: Windows a 64-bit or 32-bit. Click on the downloaded package and install it with the default settings. In the end of the installation accept the system reboot.

Once the PC is rebooted, from the Windows’ search box, digit cmd.exe and run the prompt. Then run the script c:\Users\-user-\Miniconda3\Scripts\activate.bat which will launch the Anaconda’s prompt (change -user- with the current account name).

Therefore, digit: conda install numpy scipy mkl-service m2w64-toolchain in order to install:

  1. numpy” Python library which is very useful for the matrices and arrays management.
  2. scipy” that is a scientific computing library for python.
  3. “mkl-service” optimization library with vectorial maths routines to speed-up mathematical functions and applications.
  4. “libpython” library for Python 3 for Machine Learning and effective code development. 
  5. “m2w64-toolchain” providing a GCC compatible version, so it is strongly recommended.

Other optional libraries are:

  1. “nose” library for programs testing in Python.
  2. “nose-parameterized” for the parametric testing.
  3. “sphinx” library for building program’s stylish documentation in diverse formats (HTML, PDF, ePyub, etc.).
  4. “pydot-ng” interface for the graphic rendering language Graphviz’s Dot.

Once the environment settings are finished, at this stage you are able to install Cuda drivers from the following link:


This will open the following view with different operating systen options and things:



Download the local version (recommended) of the installation file and proceed with the Cuda drivers installation. Cudas are parallel programming libraries of the Nvidia GPU (Graphic Processing Unit) which is part of the video card. It might be necessary to install the card drivers as well in case it is not updated or it is not working properly.

When the Cuda driver installation process finally ends (and all the possible graphic card drivers)  run the Theano installation plus the additional supporting library “libgpuarray” which is required to handle tensors on GPU, with the command:

conda install theano pygpu

Theano NOTE 1: In order to install Theano we suggest to always use at least 1 point version less of Cuda with regard to the current version. This is due to uneffective maintenance of Theano which is not rapidly up-to-dated and this leads to compilation errors after the installation with the current version of Cuda. For instance, at this time, the most stable version of Theano is 0.9.0, for which we suggest to use Cuda 8.0 instead of Cuda 9.0. There might exist some tricks online to make things working between Cuda 9 and Theano 0.9 but they turn out a little bit tricky and take time, and it will not be worh the risk eventually. Our hint is to handle a steady Cuda-Theano configuration such as the ones recommended.

Now you need to install Visual Studio providing to Theano the C++ compiler for Windows (indeed the previously installed GCC refers to the only C compiler). In order to do this, download Visual Studio Community from the link: https://www.visualstudio.com/it/downloads/ and follow all the required steps, trying to install only the basic components for C++.

Theano NOTE 2: Seemingly, after the next release, Theano will be dismisses, Bengio himself explains it in this: link There are multiple reasons for this choice, we believe essentially due to the latest massive competition from the other Deep Learning frameworks (mxnet, tensorflow, deeplearning4j, gluon, to name a few) which are more mantained. As we just showed, Theano constantly suffers updating problems from the MILA team. However we believe Theano is still a milestone for Deep Learning, the first that introduced the automatic differentiation, clear and  effective parallelization of matrix operations on GPU that enabled the spread of GPU deep neural networks. Hence we consider the need of giving the right prestige to this brilliant framework, and after all it still confers its upside in terms of versatility and speed when used as backend of Keras.

Visual Studio NOTE: Also Visual Studio is affected by compatibility problems with Theano. Basically Visual Studio 2017 will return an exception during the import of Theano both with Cuda 9 and Cuda 8. Therefore we suggest to install a stable preceding version like Visual Studio 2013.

Once you have installed Visual Studio you need to fill the .theanorc, which is the Theano configuration file, you can find it within Miniconda3 at the path: c:\Users\-user-\.theanorc

Fill .theanorc as follows, given that you have decided to install Cuda 8 e Visual Studio 2013 :

device = gpu
floatX = float32

root = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0


compiler_bindir = C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin

cnmem = 0

Let us pay attention to these parameters: “device” of [global] section which define whether you want to use CPU or GPU, “root” of [cuda] section needed to set Cuda libraries path, whereas “compiler_bindir” of [nvcc] needed to define the C++ compiler path of Visual Studio which is critical to compile Theano programs. CNMeM refers to a library (built-in in Theano) allowing you to set (by means of a value between 0 and 1) the way Deep Learning framework is capable to handle the GPU shared memory, and the way to speed-up neural networks computation on Theano. For example, video cards shared with the monitor we suggest a parameter around 0.8 whereas stand-alone graphic cards work with a cnmem equal to 1.

Another very important parameter for boosting the computation, especially for the convolution, is the setting “enabled” of [dnn]  section which allows to enable or disable Nvidia CuDNN libraries. This is basically a library supplying optimized primitives for deep neural networks leading to the speed-up of th training stage, testing and to energy saving.

In order to install CuDNN you need to go to this link: https://developer.nvidia.com/cudnn and click on the download button. Proceed with the download (Nvidia membership registration might be necessary), the following screed should pop up:



cuDNN NOTE: also in this case as previously stated, we advise not to download the last version of cuDNN but one of the two preceding version as it may not be “seen” neither by Cuda 8 nor Theano 0.9, in this case we recommend cuDNN 6.0 version. Anyway youa warning may arise with Theano 0.9 indicating that cuDNN version is too much new and could generate possible problems. We noticed incompatibility problems between cuDNN and TensorFlow as well.

Extracting the downloaded file, you will obtain 3 folders: bin, lib and include. All you need to do is copying the content of these folders into the folders of the same name withing the Cuda directory, namely inside: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0

Hence copy cudnn64_6.dll into bin of Cuda path, copy cudnn.h into include and finally copy cudnn.lib into lib.

Once you have installed cuDNN, go with the installation of Keras by means of pip:

pip install keras

The instruction will install all the dependencies and also the last version (currently Keras 2.0.9). To set Theano as Keras backend, go into the folder: c:\users\-user-\.keras and edit the file keras.json as follows, namely setting “theano” as “backend” item.

“floatx”: “float32”,
“epsilon”: 1e-07,
“backend”: “theano”,
“image_data_format”: “channels_last”

To check everything is going the right way, launch the anaconda prompt and then launch python. From the python prompt digit: import keras. If everything went well the following screen will appear:



Notice the warning we mentioned earlier speaking of cuDNN and showed by Theano it self: if you meet all the listed problems downgrade cuDNN to the 5.1 version as adived by the team itself. When a stable version of Theano 0.10 will come out, it will probably solve all these compatibility problems.

Anyway we know that the the environment configured with Keras and Theano according these modalities perfectly works on a diverse models that we previously trained and tested. We decided to use Theano as backed because it turns out faster than TensorFlow with some Computer Vision trainings very often.

In any case, if you want to use TensorFlow as backend you need to install it. To install tensorflow for GPU you need to do the following command:

pip install –upgrade tensorflow-gpu

This instruction will install the last version (1.4.0) of Tensorflow-gpu. To try it with Keras change “theano” with the string “tensorflow” withing the file keras.json, reboot the anaconda prompt and re-digit import keras.

TensorFlow NOTE: it is not supported on 32 bit platforms, installation program will download only the wheel related to the 64 bit framework. Furthermore, in order to download the cpu version you just need to specify the following command (without gpu): pip install –upgrade tensorflow.

If everything went fine, you will see TensorFlow appearing as keras backend this time:


Other useful packages to work with Keras are:

  1. scikit-image: A library very useful for image processing in python, that allow us to save matrices and tensors onto jpeg pictures or many other supported formats.  Installable with: conda install scikit-image.
  2. gensim: The word embeddings library implementing word2vec algorithm, among other things. Installable with: conda install gensim.
  3. h5py: The library interfacing with the format HDF5 from Pythonic, this is necessary to save models trained on disk in Keras. Installable with pip install h5py.

At this point, the environment Keras/Tf-Th on Windows is ready to go, to test you code and your models natively harnessing the GPU.


See you at the next tutorial.

Greetings from Deep Learning Italia.


For any information or clarification here you have our emails:

Francesco Pugliese – f.pugliese@deeplearningitalia.com

Matteo Testi – m.testi@deeplearningitalia.com