This study presents a biologically plausible method (based on the operation of the mammalian retina) which can convert raw input continuously into a form that is suitable for both the training and deployment stages of neural networks. It can be applied to both classical and spiking neural networks and can ultimately be used to advance computer vision applications that rely on adaptive processing of the visual field. Vision accounts for more than 80% of sensory input in humans, and therefore computer vision is likely to be an essential interaction modality that would contribute towards the development of autonomous virtual assistants and companions that can support humans in their everyday lives.

One of the most sophisticated biological sensors is the mammalian retina, which acts as an aggressive filter that discards 99% of the information that enters the eye and passes only the most essential portion of that information to the brain for further processing. This filter is highly adaptive, being able to operate efficiently over a very broad range of light intensities (the difference in luminance between midday and midnight is 9 orders of magnitude). This efficiency is propagated into the visual cortex, which can detect objects in a scene within about 120 ms, with an average of only one spike per neuron.

The operation of the mammalian retina can be modelled using a simple statistical technique (exponential moving average; EMA) paired with a transfer function which normalises the values to fall in a compressed range (either (0,1) or (-1,1)). This technique can be combined with a neural network to enable the latter to adapt on the fly to changes in the input data over time (which is related to the notion of “concept drift”). It acts as a self-regulating filter which emphasises data points that deviate from the mean of the input data (equivalent to the mean luminance in the case of the retina).

When digital data (images, text, audio...) are processed using a spiking neural network (SNN), the input (real numbers) have to be converted into discrete spikes. In this case the proposed method requires only one input neuron per variable rather than an entire population of neurons, which is substantially more economical than the commonly used Gaussian receptive fields method. Another advantage is that it is unnecessary to preprocess the data before feeding it to the network, so we can use the method to process raw, non-normalised streaming data.

One application where this can be extremely useful is anomaly detection. Since spikes are discrete, the output of the network can be interpreted as a binary decision (yes/no) at each step (cf. Fig. 1).

Figure 1: Anomalies in the temperature of a certain industrial machine detected with a SNN using the proposed model. An output is considered a true positive (TP) only if it falls within the gray vertical stripes (areas manually labelled as anomalies by an expert), otherwise it is a false positive (FP). The SNN does detect the third anomaly, which is considered to be a precursor to a catastrophic failure indicated by the fourth anomaly.

This is all well and good, but we lack algorithms for online training of SNNs, so the result in Fig. 1 was obtained after applying heuristics to the sole model parameter. Fortunately, we can also apply the same algorithm to a conventional neural network (such as an LSTM) and train it on the fly using backpropagation. In this case, the anomaly detection task turns into a prediction task (trying to predict future values in a time series). Here is the result of a one-step prediction obtained with an LSTM with an unbounded output (regression) layer:

Figure 2: (a) One-step prediction results for the temperature of a certain industrial machine (same data as that for Fig. 1) with the proposed input conversion method and raw value regression (the network output is unbounded). (b) A zoomed version of the 6K~8K portion of the plot in (a) and (c) training loss for this experiment.

Not too bad, but we can go even further. If we apply the same technique in reverse to the output of the network (assuming that the logistic function is used at the output), the data are normalised at the input and de-normalised at the output to obtain the regressand. This leads to increased regression accuracy, training robustness and speed of convergence (Fig. 3). Note that the same network is used in both cases, but in the second case the proposed model is applied in reverse to the output of the network, which is bounded to the interval (-1,1).


Figure 3: Results for the same task as in Fig. 2, but with the proposed method applied in reverse at the output.

The fit in Fig. 3 is much closer and smoother than that in Fig. 2, but a more comprehensive set of tests would reveal if this is still the case for multi-step prediction. More importantly, it would be instructive to conduct experiments on applications relying on an adaptive vision system akin to the retina. At any rate, the method is generic and applicable to any kind of data representable as a stream of real numbers. The ability to process streaming data on the fly is a step forward towards enabling autonomous agents to learn continuously from multi-modal input while engaging with their environment.