Pasted text
941 words · ~6 min · Cherry
The dominant neural network models for tasks like language translation and text summarization have long relied on complex architectures called recurrent neural networks, or for short, RNNs, and convolutional neural networks, CNNs. These models typically include an encoder that reads the input sequence and a decoder that generates the output. The best performing models also connect the encoder and decoder with an attention mechanism, which helps the model focus on relevant parts of the input when generating each part of the output. However, these traditional models have a fundamental limitation: they process data sequentially, which makes them slow to train and difficult to parallelize across multiple processors. This sequential nature becomes especially problematic with longer sequences.
In a groundbreaking paper titled "Attention Is All You Need," a team of researchers from Google and the University of Toronto propose a new architecture called the Transformer. As the title suggests, the Transformer is based solely on attention mechanisms, completely doing away with recurrence and convolution. This design allows for significantly more parallelization during training and achieves state-of-the-art results on machine translation tasks in a fraction of the time previously required.
The Transformer follows the classic encoder-decoder structure but with a radical simplification. The encoder is a stack of six identical layers. Each layer contains two sub-layers: a multi-head self-attention mechanism and a simple feed-forward neural network. The decoder is also a stack of six identical layers, but it has three sub-layers: one self-attention layer, one layer that performs attention over the output of the encoder, and a feed-forward network. Each sub-layer has a residual connection around it, followed by layer normalization. This means that instead of being processed sequentially step by step, every element in a sequence can directly interact with every other element through the attention mechanism in a single operation.
The key innovation is the attention function itself. The researchers use what they call "Scaled Dot-Product Attention. " In essence, the model computes attention by taking a query and a set of key-value pairs. The output is a weighted sum of the values, where the weight for each value is determined by how compatible the query is with the corresponding key. The compatibility is measured by taking the dot product of the query and key, scaling it down to prevent extremely small gradients, and then applying a softmax function. This is computed in parallel for all positions in a sequence using matrix multiplications, making it very efficient.
But the Transformer doesn't just use one attention function; it uses multiple. This is called Multi-Head Attention. Instead of performing a single attention function, the model linearly projects the queries, keys, and values multiple times into different subspaces, then performs the attention function in parallel on each of these projected versions. The outputs are concatenated and projected again to produce the final values. This allows the model to jointly attend to information from different representation subspaces at different positions. For example, one head might focus on syntactic relationships while another focuses on semantic meaning. The researchers found that eight parallel attention heads worked best.
Since the Transformer has no recurrence and no convolution, it has no inherent sense of the order of the sequence. To give the model information about the position of each token in the sequence, the researchers add "positional encodings" to the input embeddings. These encodings are not learned but are fixed using sine and cosine functions of different frequencies. This approach allows the model to easily learn to attend by relative positions, and it can even extrapolate to sequence lengths longer than those seen during training.
One of the major advantages of the Transformer over recurrent models is its ability to handle long-range dependencies. In recurrent networks, signals must travel sequentially across many steps, making it hard to learn relationships between distant positions. In the Transformer, every position is directly connected to every other position in a constant number of operations, drastically shortening the path for forward and backward signals. This is measured by the maximum path length: for self-attention it is just one operation, while for recurrent networks it grows linearly with the sequence length. Additionally, the Transformer can be parallelized much more effectively because it does not require sequential state updates.
The researchers trained the Transformer on two standard machine translation benchmarks: English-to-German and English-to-French. On English-to-German, the big Transformer model achieved a new state-of-the-art BLEU score of 28. 4, outperforming all previously reported models including ensembles. On English-to-French, it achieved a score of 41. 0, surpassing all single models while using less than a quarter of the training cost of the previous best. The training took only 3. 5 days on eight GPUs for the big model, and the base model could be trained in just 12 hours. The model used the Adam optimizer with a custom learning rate schedule that first increases linearly then decreases with the inverse square root of the step number. Regularization techniques like dropout and label smoothing were employed to prevent overfitting.
The paper concludes by presenting the Transformer as the first sequence transduction model based entirely on attention, replacing the recurrent layers commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, it can be trained significantly faster than architectures based on recurrent or convolutional layers while achieving superior results. The authors express excitement about the future of attention-based models and plan to extend the Transformer to other tasks beyond text, such as images, audio, and video, and to investigate restricted attention mechanisms for handling large inputs and outputs efficiently. The Transformer has since become the foundation for many subsequent breakthroughs in natural language processing, proving that sometimes, attention really is all you need.