# Analysis of Deep Learning Models using Deep Echo State Networks (DeepESNs)

**Author: Luca Pedrelli**

In the last years, deep neural networks aroused a great interest in the field of artificial intelligence. Based on a hierarchy of multiple layers, such models provide a distributed, hierarchical and compositional representation of the input information. Stacked neural networks achieved good results in many application domains significantly outperforming shallow neural networks.

Many recent studies in deep learning research are interested to analyse and study the abilities of deep models in representing data. Typically, such a kind of studies are focused on application domains where the human interpretation of the information is intuitive such as computer vision and natural language processing. On the other hand, the understanding of what happens inside deep recurrent neural networks (DeepRNNs), during the elaboration of temporal sequences is difficult (indeed intuitively interpreting a temporal feature) is more difficult than interpreting visual or text features.

In this tutorial, we present recent studies focused on the comprehension of the temporal computation performed by DeepRNNs in the field of time-series processing. In particular, the introduction of Deep Echo State Networks (DeepESNs) [1] allows to analyze and understand the intrinsic properties of DeepRNNs, providing at the same time design solutions for efficient deep recurrent architectures.

**Deep Models vs Shallow Models in feed-forward neural networks**

In order to understand the characterization of deep learning models, we firstly consider the case of feed-forward neural networks. Feed-forward neural networks can be seen as a stack of layers which are connected by feed-forward connections. In this kind of model, the input layer is fed by an input vector whereas the hidden layer compute multiple non-linear transformations. Finally, the output layer calculates the output vector.

While deep models are characterized by a stack of non-linear transformations represented by a hierarchy of multiple hidden layers, shallow models are typically composed by a single hidden layer. Both kinds of models are universal approximators of continuous functions. However, deep models have better abilities in providing a distributed, hierarchical and compositional representation of the input information.

**Analysis of Deep Learning Models: an example**

This is an example of a typical analysis of deep feed-forward models within the domain of Computer Vision. In this example, a Convolutional Neural Network (CNN) is trained to classify images.

If we have a look at the neural activity of a deep architecture, we can notice that first layers direct their attention torwards specific details such as borders and shapes. Instead, central layers are focused on patterns such as eyes and noses and successive deeper layers are focused on more abstract concepts such as faces or poses of objects.

**Recurrent Neural Networks**

Recurrent Neural Networks (RNNs) are a class of models able to represent time dependences in sequences of vectors. Therefore, they are particularly suitable for time-series processing. RNNs implement a memory mechanism through feedback connections within the recurrent hidden layer. For each input vector which is fed into the input layer, a state vector is updated to preserve the memory of the input history.

**Deep Recurrent Neural Networks
**

As in feed-forward neural networks, DeepRNN models are characterized by a stack of multiple recurrent layers while shallow RNN models are composed by a single recurrent layer.

DeepRNNs have achieved good breakthroughs in many recent applications the temporal sequences field. Basically, DeepRNNs are capable of providing a hierarchical representation of the temporal information that might be suitable to address tasks characterized by sequences with multiple time-scales dynamics.

The principal open questions within the literature around DeepRNNs are:

- Why do stacking recurrent layers achieves good performances?
- How can we understand the temporal features inside DeepRNNs?
- What is the inherent role of layers in deep recurrent architectures?
- How many layers are sufficient to have a good temporal representation?

In [1,2] some answers are proposed through the study of DeepESN model as an analytics tool.

**Deep Echo State Networks [1]
**

So far, the presented models have been typically trained with back-propagation algorithms. Thereby, in this case, all the free parameters of a neural network are usually learned in order to perform a supervised task. However, by these numerical approaches, it can be very tough to understand the internal dynamics of a neural network whenever all the parameters are empirically trained.

Here we consider Reservoir Computing (RC) approach. RC is a paradigm to design RNNs in which the parameters of the recurrent layer are randomly initialized. The only part of the network that is trained is a linear output called readout. The training of the readout typically consists of finding a closed-form solution of a simple linear optimization problem.

The extension of RC within the deep learning paradigm is introduced with the DeepESN model in [1]. DeepESNs are characterized by a stack of randomly initialized recurrent layers. The output is computed by means of a linear combination of the recurrent units in all the recurrent layers, allowing to exploit the multiple dynamics developed in the network state.

The aim of such approach is to introduce tools for the study and the design of efficient DeepRNNs. Moreover, the consideration of un-learned recurrent connections fosters the study of the intrinsic properties of deep recurrent architectures.

**Multiple Time-scales Diversification [1]**

The analysis proposed in [1] over DeepESN models highlights the hierarchical organization of the layers in DeepRNN architectures which are intrinsically (switching off the learning of the recurrent connections) able to diversificate the multiple time-scales dynamics developed inside of the network state.

Blue lines in the Figure below are relative to the recurrent layers of a DeepESN that receive a perturbed input sequence into the input layer. Darker lines represent deeper layers. The y-axis depicts the deviation between the state of a DeepESN (fed with the input sequence perturbed at a certain time-step) and the state of a DeepESN fed with the normal input sequence. The x-axis represents the number of time-steps after the perturbation. The effects of input perturbation remain longer in the deeper layers. Moreover, the dynamics are ordered along the architecture’s hierarchy.Another analysis in [1] compares DeepESN with two other architectures to study which kind of recurrent structure has richer temporal dynamics.

Interestingly, the diversification of the multiple time-scales dynamics is weak when the inputs are connected directly with the deeper layers (DeepESN-IA) or when the layers are not hierarchically connected (GroupedESN). In these two cases, the dynamics that are present in the architectures are more “flat”, similar to the dynamics developed in shallow RNNs.

**Ordered Structure of temporal information [2]**

In [2], it has been introduced a deep echo state network architecture with linear activation called L-DeepESN. This model allows to get the internal mechanisms of a hierarchical recurrent architecture, enabling the study the inner role of layering in DeepRNNs. Interestingly, the deep recurrent architectures are naturally (without the need of learning) capable to provide hierarchical and distributed features of the temporal information (through a multiple frequency representations). Specifically, the magnitudes of the frequency components are diversified in the successive layers and the temporal structure is ordered according to the depth of the layers with the higher layers that are focused on lower frequencies.

The consideration of linear activations in L-DeepESN can lead to finding an algebraic expression of the equivalent linear system with only one recurrent layer. The matrix of the recurrent weights of such system has a particular structure (triangular block matrix) that highlights the characteristics of the layering in deep recurrent architectures. In particular, the layering in the architecture characterizes the structure of the matrix and this structure produces the ordered temporal features in the network state.

From an applicative point of view, DeepESNs achieved a performance improvement of several orders of magnitude with respect to standard RC approaches on forecasting tasks characterized by multiple time-scales dynamics [2].

**Conclusions**

In conclusion, DeepESNs foster the study and analysis of the intrinsic architectural properties of DeepRNNs [1,2]. In particular, the hierarchical organization of recurrent layers enhances the memory capacity, the entropy (information quantity) and the multiple time-scales diversification [1] as far as the architecture dynamics are concerned. Moreover, the use of a linear activation in L-DeepESNs [2] facilitates the identification of features represented by DeepRNNs and to define an algebraic expression that characterizes the effect provided by deep recurrent architectures on the dynamics of state. Finally, DeepESNs provide efficient solutions to design DeepRNNs suitable for applications with temporal sequences characterized by multiple time-scales dynamics. You can also find an up-to-date survey on the advancements in the development, analysis and applications of DeepESNs in [3].

In the next tutorials, we will show recent advances in challenging real-world applications achieved by deep Echo State Networks, stay tuned! 😀

**References**

- [1]: C. Gallicchio, A. Micheli, L. Pedrelli. “Deep Reservoir Computing: A Critical Experimental Analysis”, Neurocomputing (2017), Elsevier, vol. 268, pp. 87-99, DOI: 10.1016/j.neucom.2016.12.089

- [2]: C. Gallicchio, A. Micheli, L. Pedrelli. “Hierarchical temporal representation in linear reservoir computing”, WIRN (2017), preprint arXiv:1705.05782

- [3]: C. Gallicchio, A. Micheli. “Deep Echo State Network (DeepESN): A Brief Survey.” arXiv preprint arXiv:1712.04323 (2017).