In this article, we introduce an algorithm which separately trains multiple neural networks without usual supervision to achieve a behaviour that could be assimilated to human verbal mimicry, verbal imitation, or early lexical acquisition.
Unsupervised learning (UL) and self-supervised learning (SSL) describe the way an algorithm can train a model to find patterns in data without explicit supervision from dedicated labels. UL has seen many great strides in the last 10 years with an increase in deep learning (DL) algorithm abilities to tackle complex challenges thanks to pretraining using UL algorithms.
This article will consider SSL and UL to be closely related even if there are still debates on the appropriate terminology. [LeCun on UL / SSL]
- In Natural Language Processing (NLP) with DL, SSL was popularized with word2vec (2013). It has then been used with models like Glove (2014), Transformers (2017), Bert (2018), and GPT3 (2020). It is the usual way to pretrain text models due to data abundance, representation sizes and calculation speed. These algorithms build a representation of words and other syntactic units in a way that doesn’t require explicit dedicated labels. These representations are called embeddings, it’s a mathematical vector. For words, they’re word embeddings.
- In Computer Vision (CV) with DL, UL / SSL methods aren’t as common as they are in NLP. Many unsupervised tasks have been defined to try to overtake supervised learning algorithms, with competitive alternatives only emerging in recent years. CV profits from data abundance thanks to the internet, which gives easy access to billions of images and labels made by the community. Recent SSL methods include MoCo v2 (2019–2020), SwAV (2020) and BYOL (2020). State of the art algorithms use contrastive learning. They train neural networks to produce similar representations for two different views of the same original image while avoiding the situation where all representations are similar. These methods are however computationally challenging (1,2).
- In Computer Audition (CA) with DL, multiple processings profit from UL algorithms. CA DL algorithms can be linked with CV DL algorithms because a spectrogram can be interpreted as a particular kind of image. These algorithms include, for example, audio inpainting (2019), audio contrastive learning (2020) or the use of temporal proximity (2020). SSL has also been used to improve speech recognition with an unsupervised “next time step prediction” task (2019).
- In Video Processing (VP) with DL, UL/SSL algorithms can use both spatial and temporal information in each video, and they can also use the link between sounds and images. One of the many methods aims at predicting the future of a video while separately handling the image with a classic CNN, and the temporal coherence with an LSTM (survey 1, survey 2, 2020).
In the past years, UL with DL has shown its ability to process several types of data sources which are natural for humans. This project is in line with this trend.
I. Project and theory
The idea of the project is to reproduce a behavior that could look like human verbal imitation or early lexical acquisition based on videos. The steps are as follows:
- The algorithm saw some images and heard some sounds related to these images in the past,
- it should repeat the correct sound corresponding to the image it sees in the present.
The artificial neural representations we’ll associate shouldn’t be targeted at solving one dedicated task, the modules providing them should work autonomously. This is because humans don’t need to have hearing to understand what they see, and they don’t need to see to be able to understand language. The constraint is to provide two independent representations of sounds and images that could be used for the dedicated task we defined, and other tasks as well. Finally, we also introduce a real-time constraint to be able to process live recordings, and an online-learning constraint where data becomes available progressively and the algorithm must continuously adapt to it.
The goal isn’t: to start with pre-trained models, to generate image/audio representations that are dedicated for this task
According to these constraints, the global task will be decomposed in 3 sub-tasks:
- Building a representation of an image without supervision
- Building a representation of an audio signal of arbitrary length without supervision
- Associating the representation of an image to the representation of a close audio signal
In this article we propose:
- The aforementioned Artificial Early Language Acquisition task
- A very short dataset we made to evaluate our models: AELA-d0
- The AELA0 model, our baseline for this task
A bit of linguistic theory
One of the theoretical bases of this work is the semiotic triangle. The semiotic triangle introduces three concepts:
- the object in reality (a dog for example, the referent),
- the way we refer to this object (“a dog” in english, the symbol),
- and the idea of this object (the same dog but in my head, the thought).
In this project, there are four main elements:
- The raw image, a mathematical tensor of shape 3.H.W (the referent);
- the audio signal as a waveform, a vector (the symbol);
- the representation of the image, a vector
- and the representation of the audio signal, a vector.
The last two can be assimilated to the thought. The theory behind this project also supports the idea that thoughts can partly be reproduced with artificial neural representations (or embeddings): the idea of thought vectors.
Audio signals representing words are harder to process than written texts when both contain the same information. Written text is a kind of curated version of audio signals containing human languages. It usually standardizes and removes a lot of information from a speech like hesitations, stutterings and silences. Because of this standardization, research for joint image-text learning and text-audio learning is more advanced than research for direct image-audio learning.
Despite that, many attempts for algorithms that could learn like humans from videos have been made. Roy et al (2001) implemented a long-term and a short-term memory to associate the shape of an object with phonemes. They had many technical limits (color, size and textures of objects were ignored, phonemes required supervised learning, computational means were limited). More recent works include Harwath et al. (NIPS 2016, 2017, 2018) and Rouditchenko et al (2020). These works aim at associating an image and an audio signal with embeddings that are dedicated for this task. Without images: no training of the audio model; without the audio: no training of the image model. These recent works also don’t use a global memory for the model and don’t aim at being used to learn on data that are only becoming available gradually.
The “learning like a child” idea has been of great interest to the AI research community. Proposed methods include reinforcement learning, few shot learning, meta-learning, and self-supervised learning. Humans probably do need to do all of these and more on multiple data sources to train properly. For SSL on images we can bring attention to Orhan et al (2020) which trained a MobileNet with MoCoV2 on a dataset made of images from a camera mounted on a child’s head (SAYCam). They observe the ability of SSL algorithms in extracting high-level visual features with this kind of dataset. For SSL on sounds, Lakhotia et al (2021) noted the still open-question of systems that learn from natural interactions and how it could be important to learn human behaviors through prosody and intonations.
Our project isn’t going to go as far as all the recent works on all the specific aspects of SSL, it instead aims at continuing some of these works on a new specific task with new constraints.
2. Data and target
2.1 Input, Output
The input data of the project is a list of videos, or, a live recording. The output data is sound.
Videos are preprocessed in real-time and turned into a list of images with temporal information and a list of short sounds with temporal information. The temporal information is only used as a way to link an image with a close sound.
For some specific parameters, the algorithm selects images at 0.5~1.0 fps and excludes blurry images. Images keep their aspect ratio with a minimum size of 224pixels. The audio length ranges from 1s to 10s, at 44.1kHz. The algorithm starts recording the audio only if the amplitude of the signal is above a certain limit (above background noise). We mainly aim at selecting individual words.
Compared to a random dialog between adults, the audio will be curated. But, when talking to a toddler, we use simpler words or just give the name of an object multiple times hoping that they can repeat it.
After preprocessing, we have a list of clean images and audio signals. We write them in a folder on disk. This folder acts as a perfect memory of raw data (images and sounds). All of this happens in real time while reading the video and training the models.
For images, we only need to have a valuable representation of the information it contains. Images are an input of the global model.
However, for audio, we also need to be able to reproduce a sound. Sounds are used as an input and an output of the global model. Waveform data will be processed as mel-spectrograms during training.
To evaluate some algorithms independently, standard datasets have been used. These datasets are ImageNet and CommonVoice FR 0.6.1. The goal of these tests isn’t to outperform a SOTA but to independently assess the performance of the algorithms on standard datasets.
To evaluate the ability of the algorithm to fit all our specific constraints, custom videos have been made. These videos focus on one specific object with a human voice stating the name of the object explicitly and multiple times. The objects are: an acorn, a swiss army knife, alphabet from A to J, texts “AL”, “PHA”, “BET”, a stainless steel bottle, a pen, a spoon, a thermometer and a toothbrush.
A test set with some of these objects in close but different conditions has also been made to evaluate the ability of the algorithm to learn on this custom dataset without overfitting too much.
The training set is cleaner than the test set. The test set comes from one video which includes a hand manipulating the objects or no objects at all.
The goal isn’t to reach the best possible accuracy on this custom dataset but just to provide a working proof of concept of early language acquisition.
Now we have a database with images, and a database with sounds. This is the top part of the following graph. The bottom part is about the models:
We named this model AELA0 for “Artificial Early Language Acquisition zero”. In this global model, four concrete DL models train at the same time. A model dedicated to images, two models dedicated to sounds and one model dedicated to the association.
We feed raw images to the image model so it can train to produce accurate image embeddings [left]. We do the same with sounds for audio models [right]. Both are used by the association model in the middle. This model reads raw images and sounds [middle], converts them into embeddings [middle bottom] thanks to the other models, and predicts the audio embedding based on an image embedding [left to right flow]. Then we can re-use the audio model to produce the sound from the predicted audio embedding [bottom right].
All models have undergone hyperparameter optimization at some points during the development of the project. Configurations are given as examples. We indiscriminately use “embeddings”, “representations”, “features” and “artificial neural representations” to designate the same concept.
3.1 — Building unsupervised image representations
The algorithm chosen for CV SSL is SwAV. Main reasons included: it was one of the SoTA for CV SSL when the project began, it includes experiments with small batches, it seemed quite straightforward to re-implement, the source code is available, the authors also evaluated its accuracy with frozen features and our algorithm will only use frozen features for the end task.
SwAV roughly works like other contrastive learning (CL) algorithms. The loss lowers when representations of different transformations of the same image are close, without the trivial solution of having the same representation everywhere (collapsing).
But SwAV introduces or reuses some tricks, here are the ones we kept:
A/ Instead of directly comparing the features that represent two different images, SwAV uses a proxy of these features [Fig1]. The network isn’t constrained to produce features with identical values for similar images, it is constrained to produce features that contain the same information. The loss is a cross entropy applied on a post-processing of the extracted features. The constraints aren’t directly applied on these features. We hypothesize that indirectly applying these constraints makes the features more free to represent a larger variety of visual elements.
B/ Rather than assigning explicit positive and negative pairs, SwAV only uses indirect positive pairs. It constrains multiple parts of the algorithm during training so that it has implicit negative pairs:
- The first output of representations corresponding to the first set of augmentations is used as predictions and constrained with softmax. It outputs probabilities between 0 and 1, it’s a line constraint (in dimension 1 on the matrix of shape batch_size * embed_size) such that the post-processed representation of one image sums to 1.
- The second output of representations corresponding to the second set of augmentations is used as labels and constrained with the Sinkhorn-Knopp algorithm. It outputs a globally constrained matrix with a line constraint (sums to 1) and a column constraint. The algorithm alternates constraints on lines and columns multiple times iteratively and ends with a line constraint.
This final constraint is a global constraint on the energy (sum of all elements) of the matrix [Asano et al]. For one list of representations, this constraint means that the algorithm can’t create a value that is above average somewhere without creating a value that is under average elsewhere. All values are related. If a feature gets higher, the same feature of another image has to get lower, and another feature of the same image has to get lower, etc. These values are used as labels, or clusters, such that the algorithm is free to predict with one view the constrained label resulting from another view of the same image.
The algorithm also uses temperature to avoid collapsing: It artificially amplifies the variance of representations to avoid having the same representation everywhere. The paper introduces multicrop as well but we didn’t find a need to implement it and we didn’t want to overcomplicate the model.
During training, our implementation constantly reads new images from the progressively expanding memory and learns to build representations of these new images.
The model used in our working implementation is ResNet18. Some parameters for SwAV are: embed_size: 64, temperature: 2e-2, epsilon: 5e-3, batch_size: 128.
3.2 — Building unsupervised representations of audio signals
The algorithm chosen for CA SSL is a 2d-convolutional autoencoder to compress the spectrogram and an LSTM autoencoder to compress this previous representation of variable length into a fixed length representation.
It respects two constraints:
- (1) being able to re-generate audio-signals, and
- (2) representing audio signals of arbitrary lengths with a unique representation.
Starting from a waveform, we apply FFT and mel scale to get a mel spectrogram of each second. Then a convolutional autoencoder processes and trains to reproduce these mel spectrograms, giving multiple intermediate representations for one audio signal. Finally an LSTM autoencoder processes and trains to reproduce these intermediate representations, giving only one intermediate representation for the complete audio signal, no matter its length.
The convolutional autoencoder uses 7 convolutions and 6 transposed convolutions.
The LSTM autoencoder uses 2 layers to encode and 2 layers to decode with the same size for every representation. This autoencoder also encodes the length of the audio signal in the intermediate representation such that it’s able to rebuild the right size for the audio signal at test time. To do so, it uses a linear layer on the intermediate representation to predict the length with the following dedicated loss: L2(pred_length, true_length).
Once we have a representation, we can put it in the decoder part of the LSTM autoencoder, and then put multiple representations in the decoder part of the convolutional autoencoder. We end up with a reconstructed Mel Spectrogram. We reconstruct the normal spectrogram with an inverse of the Mel scale and apply the Griffin-Lim transformation to get the waveform and produce a sound.
This algorithm can deteriorate the quality of the audio but the goal isn’t to make a perfect reproduction of the voice. In the worst situation, the algorithm compresses 441000 values (10 sec audio at 44.1kHz) into 32–1024 values, depending on the configuration.
The data feeding procedure is the same as for images. The algorithm constantly refreshes the list of audio signals to read based on the memory, which is constantly fed from videos.
3.3 — Associating representations
Now that we are able to build representations for audio signals and images, we can use the temporal information (when they occured) to associate their representations.
Example: If one image occured at t, and if an audio signal occured at t+2s, the two are close enough such that their representations will be associated, and the model will train to predict this audio representation from that image representation.
The model used here is a 5 layers multi-layer perceptron (MLP). The input data is the image representation of size 64 and the output is the audio representation of size 32.
With all these models and processings, we are now able to process videos and repeat sounds based on images with a completely autonomous training procedure that doesn’t require standard supervised learning. The supervision for the association of images and audio signals comes from humans because the algorithm needs a sound, but it trains from scratch the same way animals could, by listening and observing.
4. Training procedure
The precise training procedure contains 7 main scripts: 3 scripts for scheduling and data management (preprocessing for memory, building embeddings, scheduling), and 4 scripts for models (audio x 2, image, association).
4.1 The scheduler
The scheduling of all algorithms goes as follow:
- The script that builds the memory starts processing videos,
- then the image and sound models train on this memory.
- The algorithm builds the embeddings for the association model,
- and finally the model that does the image-audio association trains with these embeddings.
Some conditions are variable and not listed, for example models can be scheduled to start based on the accuracy of another model or based on the ability to build one or two batches without re-using the same data twice (i.e. waiting for a large enough memory)
4.2 Data augmentations for the association
The algorithm that builds embeddings for the association reuses the image model and the audio autoencoder to build representations. In order to artificially amplify the amount of data it also provides representations for the augmented versions of both sounds and images.
4.3 Building meaningful image representations
In order to build meaningful image representations, the contrastive algorithm needs enough diverse data. To achieve this, we can show special accelerated videos to the algorithm representing multiple real world situations. Videos we tried come from a web series. These videos last 11 minutes in total, without audio, so they are only used by the image model. We don’t use ImageNet because we want a situation that wouldn’t differ too much from what a human could watch. This version is an extended version of our dataset (AELA-d0e).
Based on t-SNE representations of our data with models trained on ImageNet we think that using it would allow us to build much more accurate representations but we don’t think that it would be a fair reproduction of a real-life situation.
5.1 Model evaluations
5.1.1 SwAV evaluations
For information purposes, we provide the Top 1 accuracy of different models we trained and evaluated with SwAV. Unsurprisingly, SwAV requires a lot of resources to train properly which is not our training conditions. In paper, ResNet50 manages to reach 75.3% top1 after watching 1e9 images with 64 GPUs. On ImageNet with Resnet34 and one GPU we manage to reach 33% top 1 accuracy (2e7 images). With Resnet18 on our extended custom dataset (AELA-d0e) and with online learning, we reach 4% top 1 accuracy (2e6 images).
The accuracy is evaluated with a linear classifier trained on frozen features which are produced by the model. Our evaluation method is different and in harder conditions in which the official model of SwAV after 800epochs reaches 60%. These conditions include: fewer optimization steps (3000 steps), a different optimizer etc.
ResNet34 is probably better than ResNet50 on our GPU only because it allows us to use a bigger batch size during training. We ended up using ResNet18 because of the same hardware constraint with all other models.
Training conditions and hyperparameters are always different, we only provide the accuracy to get an idea of the model’s theoretical and practical abilities to understand images based on computational constraints, dataset constraints and training conditions.
On our custom dataset, we use a distance matrix to identify the ability of the algorithm to clusterize the embeddings:
All the dark clusters (squares) indicate that the algorithm is able to produce proper representations of the objects during training AND also on data it never saw but which are similar (circles).
The specific pattern on “Al”, “Pha”, “Bet” is because the video does a loop going from Al to Pha to Bet and then it reverts to Pha and to Al.
Based on this distance matrix, we know a posteriori that our model built great representations. Our association algorithm will be able to learn on these representations during training without overfitting.
By awareness, we also checked through top 1 accuracy and through the distance matrix that using ResNet18 is a much better way to represent images in low dimensions than using raw pixel values. Even in our very special and simplified example and despite the low accuracy our model got on ImageNet.
5.1.2 Audio autoencoder evaluations
To evaluate audio representations, we use a metric based on the sign function. For a predicted representation rp and a true representation rt:
sign_accuracy = mean(sign(rp) == sign(rt))
The idea of this metric is to be more interpretable by a human than the distance between raw values in representations: Is an average distance of 0.1 between two representations good or bad? Well if the absolute values inside the representations are around 2 it’s good, if the values are around 0.2 it’s bad. Normalizing with the maximum value could also remove the influence of smaller features.
The sign accuracy is a way to eliminate this problem and because we don’t use it as a loss function the algorithm can’t trick the metric by only getting the good sign without appropriately reducing the distance. It’s a practical way to evaluate a predicted embedding. It ranges from 0.5 (random model) to 1 (perfect model).
Reminder: The CNN for audio autoencoding trains to reproduce mel spectrograms from multiple representations, and then the LSTM trains to reproduce multiple representations from one representation. The sign accuracy evaluated for the LSTM is at 95% on our validation set of Common Voice. The CNN accuracy isn’t easily interpretable, the loss of the CNN is based on the ability of the model to reproduce the mel spectrogram, per “pixel”:
loss(pred, truth) = L2(log(pred+1), log(truth+1)).
- “log()” reduces the amplitude of large values.
- “+1" removes infinite negative values for the log
We tried using a perceptual loss (vgg16) but it was less efficient in our experiments.
On our custom dataset, we also use a distance matrix and provide the same analysis we did for images:
As we can see by the large dark squares in the image representing the distance matrix, we can still find an accurate enough clustering in the training set and in the test set. However we see in the circles that the ability to have similar representations of the same pronounced words between the training set and the test set is worse than for images.
This is interesting, because it means that our audio representations aren’t able to always correctly represent the same underlying information pronounced in different conditions. However, it won’t really affect our model because the only use we have for the audio embeddings of the test set is to evaluate our model numerically.
If in the training set someone says “acorn” and in the test set this person says “acoooooooorrrrnnnn”, the two embeddings will be different but it won’t affect the model, it’ll only affect our metric of the quality of the model. The association model will just say “acorn” instead of the “accoooooorrn” we anticipated.
However, it does mean that the embeddings aren’t as accurate as we would want them to be, this is a problem when we process audio signals and not words directly, we can’t have a perfect and uniform representation of the “same” information. But it’s also not exactly what we want to have in theory because how words are pronounced is also an information.
5.1.3 Association model evaluations
The task of the association model is to predict a sound embedding so as we are measuring a distance between embeddings we also measure the accuracy with the sign function.
We have no external dataset to propose a reference on this metric. Because of the small size of the dataset (~300 images), the algorithm can easily overfit on the training set and reach 100% sign accuracy. It’s a problem because the embeddings of images aren’t completely identical between the training set and the test set, so we regulate the algorithm mainly with data augmentations, dropout and weight decay in order to reduce overfitting.
For information purposes, the sign accuracy one of our association models gets is 77% on the training set and 67% on the test set. The worst overfitting would give ~100% train and ~50% test. The best model would give 100%/100%. Worst underfitting: 50%/50%
It’s important to remember that all our algorithms are constantly training on new and old data. When the association model is training, it’s also training on old embeddings. The reason is that we can’t permanently regenerate all embeddings at the same time for the whole dataset while simultaneously and continuously training the audio models and the image model. Representations these models produce need to be partially stabilized in order to efficiently train the association model and also in order to evaluate it properly. We do permanently regenerate older embeddings with updated models but it’s done progressively while generating new embeddings for new images and new sounds, and while training all models.
Also, as we saw on the distance matrix for audio embeddings, the evaluation couldn’t be perfect anyway because the representations of the sounds we expect the model to produce on test images don’t correspond to audio representations it saw while training.
We are able to produce the distance matrix between audio representations the model predicted:
Unsurprisingly, this result is similar to the result for images because if two image representations are close, it’s hard for the model to produce two very different audio representations.
Because representations for images were accurate between the training set and the test set, our model is able to produce audio representations on the test set which are similar to the corresponding audio representations on the training set.
We have a model which is effectively able to continuously learn and reproduce the name of objects it’s seeing based on what it’s hearing.
This experiment takes ~1 hour to run on two middle-end GPUs.
5.2.1 — 1h on our custom dataset
We manually added the “Out-of-distribution” remark when what the algorithm could be predicting on was out of the training distribution. Example: a white area without objects.
The training set and the test set are short in absolute terms, but they’re large enough for a proof of concept on our very specific task. Demonstrating the ability of our algorithm to learn more objects on a bigger dataset would be interesting. However, our next goal would mostly be to increase the difficulty of the task and give new abilities to the model.
We can hear that even with a very simple audio autoencoder, the model is able to reconstruct proper sounds after using the Griffin-Lim algorithm.
During this test, the algorithm made 17 predictions. 5 are considered out-of-distribution. The algorithm said “acorn” 5 times correctly. For the “swiss army knife” sequence, the algorithm said it 3 times correctly. One time at the beginning the reconstructed audio embedding was probably not accurate enough and we couldn’t hear it well. The algorithm said “stainless steel bottle” one time correctly. The second time the output was cropped probably because the algorithm predicted the wrong length for the audio (one second instead of two). The third time was unintelligible. The hand was in front of the bottle at this moment, which could be considered as out-of-domain, but the data-augmentation could make the algorithm more robust for this case. The accuracy on the test set in these conditions could be evaluated at (5 acorn + 3 sak + 1.5 ssb)/12 = 79%.
We consider that this result is enough for a proof of concept on a small dataset. We just needed to be out of a case where the algorithm could decide by chance. These results were reproduced by multiple trainings and we already provided the careful analysis on all embeddings with distance matrices. We are also conscious that evaluating the precision and the recall could be more accurate, as well as automatically detecting when the input is out-of-distribution. Some future works are discussed in the next sections.
While training and predicting on small datasets could be thought to be easier, it required more hyperparameter tuning because our algorithms were initially configured on very large datasets (ImageNet and CommonVoice). Training on a small dataset is a constraint but it also allows us to quickly evaluate the model.
5.2.2 — Live video (not yet available)
Providing proof that the algorithm works on live recordings is one of our ambitions. However, it requires much more video editing work in order to process and show the complete video input with a synchronized output of the algorithm.
Improving the abilities of the algorithm is a higher priority for us for now.
Though, we experimented the algorithm on live recordings and the algorithm did work. It uses the same methods it used on offline videos, only the content source for the memory is different. However, the environment is much harder to control because if we make a sound which isn’t what the algorithm has to reproduce, the algorithm will train to reproduce that sound. Also, we can’t easily move the camera in a large environment for now. Training freely on a live recording of a real life situation in a wide environment would probably require us to broadcast live images from a laptop or to use USB device export through LAN.
This section is important to detail the constraints we had and others could face while building this kind of algorithm.
First, we are manipulating embeddings and representations.
While it isn’t hard to see an image or to listen to a sound, we can’t easily interpret a bunch of values that are supposed to represent a thought. For sounds at least we’re using an autoencoder so we can always go from a representation to a signal we can hear, but that’s not true for images with our algorithm.
t-SNE was extensively used in order to see what clusters our image model was making on images. Making more interpretable DL algorithms is an even bigger constraint in a global model where representations are the main component.
We also hypothesize that current algorithms don’t give us the ability to properly manipulate information with rightful constraints on these representations. What we mean for example is that we can add a constraint on the model with weight decay, and it will constrain the embeddings. But it is much harder to add a constraint directly on the embeddings. We could want to associate a certain number of embeddings to a certain proportion of the embedding space. For example, in a 2D space, we could want 4 clusters in top left, top right, bottom right and bottom left positions. We could want 16 clusters. In a 64D space we could want 10.000 clusters. And the algorithm would progressively fill the space with representations of new objects or sounds without disrupting older representations too much.
In order to easily manipulate embeddings, we need to have hyperparameters and methods to configure what we expect from them. In current models, these representations are very free and we almost only expect them to fit the task, no matter how. But deciding how they fit it can be important in order to manipulate them more easily for other tasks.
Continuously training multiple models is another big constraint of our algorithm. It deeply complexifies the ability of our models to be synchronized, they’re all training and changing simultaneously. This ends up causing a structural memory loss of learned behaviours: even if the association model learned to produce a great representation before, because newer representations of the same data it gets from the image and the audio model are in another “updated” location/paradigm in their embedding space, it can’t totally use what it learned in the past. We hypothesize that a part of why we as humans need years to train, and why we can’t remember our lives when we were toddlers, is because we need to stabilize the thoughts between each module of our brain.
Our global abstract model does have a similar issue, and probably all interconnected models which continuously and independently learn on new data while manipulating embeddings would have the same issue.
The hyperparameters we selected are also a bit data-dependent. We modified them multiple times between our evaluation on ImageNet/CommonVoice, custom videos, and live recordings. It’s an important difficulty because we can’t prove that the algorithm completely works to do what we pretend it can do if it only works on our custom dataset and not on a completely different dataset. This constraint is amplified by the online learning constraint in which our results could not only be data-dependent but also depend on the order in which data became available.
It’s possible that we could reduce some of these issues by letting the model train and stabilize for a long time but it’s harder to evaluate and adapt our algorithms quickly if each experiment takes multiple days to complete, and a rapid adaptation to new data is one of our constraints.
We also want to point out that a lot of the constraints we had aren’t DL constraints but purely algorithmic constraints: manipulating videos, live recordings, audio, synchronizing every script etc. DL is the core but it’s not the whole model.
The final difficulty is mainly in the dataset and the preprocessing. Our dataset was engineered to be quite clean. The person saying words isn’t coughing, or sometimes talking about other stuff, or laughing. The task is harder when you allow any kind of noises. While we do have a more realistic dataset than a pure text-image dataset, our audio-image dataset isn’t with “any audio”. It uses highly curated audio. We can suggest that our short highly curated dataset could lead to the same results in trend as a large non-curated dataset.
7. Future works
Based on these difficulties, we propose some solutions that we think could alleviate a couple of issues.
A first enhancement we find interesting would be to use an autoencoder for images. Autoencoders aren’t the state of the art for UL on images but they would allow us a much better understanding of what the algorithm is learning. We could indeed generate images based on sounds and directly see what the algorithm pictures when we talk about an object. It would also reduce some difficulties we had in handling the intermediate representations produced by SwAV. The way SwAV trains the model reduces the total batch size available because of the need for positive pairs, and it makes understanding how models learn harder because it requires some tricks to work properly. Some of these tricks include using multiple temperatures, using a memory bank of embeddings, freezing some parameters while training, etc. In our experiments SwAV also required a high diversity of images in batches, which is what we have for ImageNet but not for any dataset.
Contrastive learning may not be the best method to use when one of the goals can be to quickly train on small datasets. We think that autoencoders may not have some of these issues.
Processing bigger datasets is also an area of improvement. The main problem is either the ability of the model to process any kind of audio signal, or the ability to create a very large dataset with clean audio signals. Datasets oriented to study how toddlers learn exist for images, we can cite the Toybox dataset (publicly available) for example, and we already cited a work on the SAYCam dataset (not publicly available). However we didn’t find a publicly available dataset that would be suitable to teach children how to jointly learn words and images. Large datasets dedicated to language acquisition with videos exist like the CHILDES corpus [wiki/dataset], but we didn’t find any egocentric videos in it. The Human Speechome Project has the same issue and isn’t publicly available. These datasets are probably more useful to study language acquisition than to reproduce it. We can also refer to the recent Ego4D dataset (not publicly available yet), which contains a lot of videos from a first-person point of view but which isn’t targeted at solving early language acquisition.
A strategic enhancement would be the ability to process any sentences and autonomously decompose words and understand them. A future project we have is to improve the NLP abilities of our model. Our current model is indeed not truly understanding words with a verbal context and can’t understand sentences as well as models which directly work on written texts. This enhancement could allow the model to grasp more abstract concepts and we could apply usual SSL NLP algorithms. Our algorithm could be appropriate as a module of a bigger model to reproduce the holophrastic stage of life, but not for later on.
A part of the work we could have to do is to make the generation of audio signals more realistic. We as humans have our own voices but the model we introduced is only able to repeat a voice it previously heard. Maybe this characteristic is purely physical and isn’t linked with any kind of intelligence but we don’t know that for sure. We tried using a “Pink Trombone” (sound warning) algorithm to generate audio based on an intermediate representation, but these algorithms are hard to programmatically manipulate. We didn’t find a clean interface that would allow us to programmatically do:
[embedding] => [parameters for Pink Trombone] => [Pink Trombone] => [waveform]
We think this could be required to achieve a lot of vocalizations toddlers do like “dadadadada”, “nanananana” or “oooOOOoooo”.
There are multiple reasons as for why AELA0 is an incomplete model: It can’t properly process any audio signal and noises, it can’t autonomously extract words in a sentence, it can’t make a coherent sentence based on time or on what it previously said, it can’t have a need to ask questions etc. Basically it misses many abilities we can find in humans, but even in multiple modern chatbots.
The model also does miss a “will” for now, it doesn’t want anything so it doesn’t need to express itself or to communicate with us. Its goal is just to say what it’s seeing. Even if we had the perfect model to reproduce what we do, we would still need to artificially induce some goals in it. As humans we have biological and psychological goals in our lives.
Some constraints on the task could also change. The current task allows the use of a perfect memory but our human memory is probably closer to an imperfect memory of representations like the one used by SwAV. We’re not sure that this constraint matters as our algorithm can’t use the perfect memory during inference.
Finally, future works could focus on using the complete range of DL models abilities. We proved that our model can work solely with DL models even under many learning constraints. This part still requires improvements. But, it also proves that we could theoretically train a model that would end up with an excellent accuracy on ImageNet, and an excellent ability to understand images and link it with sounds on its own.
While we have imposed on ourselves multiple constraints to be in conditions that we think are close to the learning conditions of humans, these constraints can still be relaxed depending on our goal.
If we wanted to, we could start with a ResNet18 pretrained on ImageNet. We could start with autoencoders pretrained on CommonVoice. And using a better hardware would also allow us to process data faster and to process more data.
Artificially reproducing human language acquisition is a great linguistic challenge. It will help us build algorithms that could start to learn and interact with us like humans.
This project is an incremental step towards this goal. By mixing complex learning requirements with a memory using images and sounds, we added new constraints to previous works in this field. We then introduced multiple independent neural modules that could directly learn from natural signals according to these constraints.
We hope that this work can be one of the many steps we need to bridge the gap between cognitive sciences, linguistics and deep learning.
Author - “Elie” at firstname.lastname@example.org