This is an old experiment I did back in 2020 to explore the tradeoff between neural network memory and depth in achieving the same tasks.
It was a piece in my research proposal for what I would have liked to explore in a post-doctoral work (link here). Unfortunately, my supervisor wasn't interested, and I fucked up the interview with OpenAI in 2020 by not knowing what LeetCode is...
If we use RNN on a task - like image recognition -, where the network looks at a window/segment of the data at a time, how does it compare to a Feed-forward Neural Network, which looks at the whole data point at once?
The idea comes from my current work on reasoning systems using neural networks. One of the issues I've is to have some sort of an upper boundary on what the performance of a reasoning system can look like, compare to a traditional non-reasoning system.
I define reasoning as a sequential process - thus it includes the usage of memory - in order to make a decision about an input. In contrast to models like Feed Forward Multi-Layer Perceptron (MLP) - where the decision is based/conditioned on the whole input at once -, a reasoning system will take a look at windows of the input, one at a time, record some information/notes about each, and use the recorded information in order to reach a decision.
We can think of Recurrent Neural Networks as reasoning systems, although more constrained. For each time step, the model is assumed to make one step of reasoning only (which does not need to be the case).
In order to investigate this question, I made the following protocol
I compare the following networks for a wide range of hyper-parameters:
MLP, with building block of linear layer + Tanh activation
# of hidden layers: 0, 1, 2
# size of hidden layers: 2, 8, 16, 32, 64
Vanilla RNN, with Tanh activation (# of layers, size of the hidden state).
# of RNN layers: 1, 2
size of the hidden state: 2, 8, 16, 32, 64
The window size for RNN is just 3 horizontal columns at a time. The last columns is dropped (this choice was completely random).
The training for all the models is 200 epochs. I used Adam optimizer, with learning rate = 0.0001.
I measure efficiency as the performance divided by the number of parameters in the model.
For each hyper-parameters, the experiment was repeated 10 times with different random weight initialization.
I report the following metrics:
Mean accuracy and cross-entropy: on the train (to see the capacity of the model) and test (to see the generalization of the model) datasets.
I perform this experiment over two datasets: MNIST and Fashion-MNIST.
MNIST is a standard benchmark for machine learning. Fashion-MNIST is a more complex visual dataset, further testing the models.
First, I look at the accuracy over the test data. We can see that RNN outperform MLP over all number of hyper-parameters consistently. When looking at the cross-entropy result, we can see a similar pattern. All of this presents a good case for sequential reasoning over MLP all-at-once style.
Another interesting thing is to look at the results on the training data, as it can indicate the capacity of the different models at each number of parameters. It is very interesting to see that RNN has larger capacity to learn than MLP over all number of parameters.
In this exploratory experiment, I showed that, for the same number of parameters, a RNN does outperform a MLP on the static task of image recognition (which is usually the domain of MLP). I demonstrated this on two datasets, and a wide variety of hyper-parameters for both base models.
This doesn't prove a superiority of RNN over MLP. A more comprehensive experiment (for formal comparison) is needed in order to address such a question. It is the intention however to make the case for reasoning systems, and that they can perform comparably well.
I would love to hear your comments and suggestions on this experiment.