Photo by Suzanne D. Williams on Unsplash

How to use the Transformer for Audio Classification

Deza Pasquale
codeburst
Published in
6 min readFeb 6, 2020

--

Special consideration for the positional encoding and experimentation on it.

The transformer has come to solve different issues in the NPL field, mainly in seq2seq tasks where RNNs get computational inefficiency when sequences get long [1].

The paper “Attention is all you need” [1], introduces a new architecture named “Transformer” which follows an encoder-decoder schema.

From ‘Attention Is All You Need’ [1]

Before our input goes to the first encoder layer, each word gets embedded and a positional encoding is added, then:

  1. It flows to a Multi-Head Attention.
  2. Then a residual connection is implemented and added to the result of the Multi-Head Attention. After that, a layer normalization is applied. It means that the output of this step is LayerNorm(x + Sublayer(x)) where Sublayer is the function implemented by the sublayer itself.
  3. A position-wise fully connected feed-forward network.
  4. A residual connection and layer normalization are applied again.
  5. If you have more than one encoder layer, it goes back to step 1 for the next encoder layer.
  6. The output of the N encoder layer (the last one) is the one that goes to each layer of the decoder.

For the decoder is similar but it has two differences:

  1. The input of the decoder is masked, this avoids the decoder to see the “future” and
  2. It has two Multi-Head Attention in a row before going to a position-wise fully connected feed-forward network. One of them is for the encoder output and the other one for decoder input.

Finally, after the last residual connection and layer normalization, the output of the decoder goes through a linear projection and then a softmax, which gets the probabilities for each word.

This a short explanation about how it works but I highly recommend you to take a look at this tutorial for a deeper explanation and understanding.

Now let´s find out how to implement this with audio data.

Dataset

We will use the FreeSound AudioTagging data set from Kaggle [5], where we have two datasets for training: curated and noisy subsets. We will use the curated subset that implies a total duration of 10.5 hours, 4970 audio clips and their durations range from 0.3 to 30s. The full description is here.

Pre-processing

Before feeding our algorithm, we need to process our data and get the mel-spectrogram for each audio clip. The mel-spectrogram is the spectrogram in mel-scale which represents how people hear sounds. For example, we notice a big change in a frequency of 100 Hz when it changes to 150 Hz but it does not happen the same when the frequency is 10000 Hz and then is 10050 Hz, so the mel-spectrogram shows a bigger change in the former than the latter, even the absolute value of the change is the same.

We need to define some parameters about how we will process our audio clips. The n_fft parameter is how many bins we will have for allocating different frequencies, more bins mean we can detect more frequencies and therefore, we have a better resolution in this vertical axis.

The hop-length follows a similar idea but for the horizontal axis, so a smaller hop-length gets us more time steps. For example, if we have a duration of 7 seconds, a sampling rate of 44100 and hop -length of 128, we will have around 2411 time steps.

number_time_steps= (duration * sampling_rate) / hop_length
number_time_steps= samples / hop_length

We also will trim the audio clips if they are longer than ‘samples’ or pad them if they are shorter than ‘samples’. This is because as we describe in the previous section, their durations range between 0.3s and 30s and it is easier working with audio clips whose durations are the same.

Other preprocessing we can do is to reshape our data [2] applying the next formula, reducing the number of time steps that we have:

In the next block of code, we set our pre-processing setting:

class conf:sampling_rate = 44100duration = 4  #in secondshop_length = 300fmin = 20fmax = sampling_rate // 2n_mels = 128n_fft = n_mels * 20samples = sampling_rate * duration

If you are eager to learn more, you can check this course on Coursera and/or this kernel on Kaggle.

Experiment

For our experiment, we want to compare the performance using a linear projection before adding the positional encoding [3] versus the scenario where we just add the positional encoding to the mel-spectrogram. The latter might be harmful to the learning process [2] and that is the reason for using a linear projection because the mel-spectrogram is not a learnable thing like a word embedding.

In this experiment, we will fit 20 models of 10 epochs each with identical configuration, except 10 models will be trained with the linear projection and the rest without it. After that, we will select the highest val_accuracy for each model, considering all its epochs. Finally, we will get an average for both scenarios and then we will compare them.

Model architecture

For our purpose, we will use only the encoder structure and its output will feed a fully-connected layer with softmax activation function to classify the different sounds.

We will have:

NUM_LAYERS = 2 
D_MODEL = X.shape[2] ## Number of mel bands
NUM_HEADS = 4
UNITS = 1024
DROPOUT = 0.1
TIME_STEPS = X.shape[1]
OUTPUT_SIZE=80

For implementing this, we will use most of the code from Tensorflow Tutorial but adapting it to our task. The full code is here.

Results

After running the experiment, the mean val_accuracy for each scenario is the next:

And the standard deviation is:

Conclusions

First, the accuracy is very low for both scenarios. We can attribute it to the fact that 48 layers were suggested [3], 2048 neurons, 8 heads and a model dimension of 512. We have only trained them with 2 layers, 1024 neurons, 4 heads and a model dimension of 128.

Second, using the linear projection described here [3] gets, on average, an improvement of 17% comparing to just adding the positional encoding.

Finally, there is plenty room for more experimentation, like adding pooling layers, implementing another kind of positional encoding, implementing the learning rate schedule explained in [1], modifying the transformer setting (more layers [3], number of heads, etc) and applying another pre-processing or feature engineering to the audio clips.

References

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, December 6). Attention Is All You Need. Retrieved from https://arxiv.org/abs/1706.03762

[2] Sperber, M., Niehues, J., Neubig, G., Stüker, S., & Waibel, A. (2018, June 18). Self-Attentional Acoustic Models. Retrieved from https://arxiv.org/abs/1803.09519

[3] Pham, N.-Q., Nguyen, T.-S., Niehues, J., Muller, M., Stuker, S., & Waibel, A. (2019, May 3). Very Deep Self-Attention Networks for End-to-End Speech Recognition. Retrieved from https://arxiv.org/abs/1904.13377

[4] A transformer chatbot tutorial with Tensorflow 2.0. Retrieved from https://medium.com/tensorflow/a-transformer-chatbot-tutorial-with-tensorflow-2-0-88bf59e66fe2

[5] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. Audio tagging with noisy labels and minimal supervision. In Proceedings of DCASE2019 Workshop, NYC, US (2019). Retrieved from https://arxiv.org/abs/1906.02975

[6] https://www.kaggle.com/maxwell110/beginner-s-guide-to-audio-data-2

[7] https://www.kaggle.com/daisukelab/cnn-2d-basic-solution-powered-by-fast-ai

Did you enjoy this post?
Recommend it, by sharing it on your Social Networks

Do you want to read and study data science with me?
Follow me on Medium

--

--

Deep learning fan, argentinean, traveler and bad at taking pictures.