script type="text/javascript" src="http://latex.codecogs.com/latex.js">

【轉(zhuǎn)載 | 翻譯】Visualizing A Neural Machine Translation Model（可視化講解神經(jīng)機(jī)器翻譯模型）

轉(zhuǎn)載并翻譯Jay Alammar的一篇博文：Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

原文鏈接：https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

可視化講解神經(jīng)機(jī)器翻譯模型（基于注意力機(jī)制的Seq2Seq模型）

Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014).

sequence-to-sequence模型(以下簡稱Seq2Seq)在機(jī)器翻譯、文本摘要、圖片標(biāo)注等領(lǐng)域取得了很好的成績。Google翻譯也從2016年底開始在其產(chǎn)品中使用該模型（相關(guān)細(xì)節(jié)參見早期的兩篇論文：Sutskever et al., 2014, Cho et al., 2014）。

I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post. I hope it can be a useful companion to reading the papers mentioned above (and the attention papers linked later in the post).

盡管如此，如果想充分理解并實(shí)現(xiàn)該模型，你得先拆解一堆錯(cuò)綜復(fù)雜的概念。我覺得如果把這些概念及其關(guān)聯(lián)用可視化的形式表現(xiàn)出來，可能會(huì)更有助于理解。這就是我寫這篇博文的主要目的。當(dāng)然想要看懂這篇博文，你需要提前學(xué)習(xí)并掌握一些深度學(xué)習(xí)的相關(guān)知識(shí)。總之，期望這篇博文能對你理解上面提到的兩篇論文（以及稍后提到的注意力機(jī)制論文）帶來有用的幫助~！

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items. A trained model would work like this:

seq2seq模型是這樣的一種模型：它接收一個(gè)item序列（item可以是word、letter、feature或image等等），然后輸出另一個(gè)item序列。如下圖所示：【譯注：輸入的item應(yīng)該是同質(zhì)的，比如都是word，或都是image，應(yīng)該不能混雜；輸出的item也類似】

In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words:

應(yīng)用在神經(jīng)機(jī)器翻譯場景下，就是將由一系列單詞組成的序列輸入到模型中，順次處理這些單詞后，在產(chǎn)出一系列單詞組成的另一個(gè)序列：

Looking under the hood

Under the hood, the model is composed of an encoder and a decoder.

The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder send the context over to the decoder, which begins producing the output sequence item by item.

表象之下，seq2seq2模型由一個(gè)encoder（編碼器）和一個(gè)decoder（解碼器）組成。

encoder處理輸入序列中的每個(gè)item，把從item中獲取到的信息編譯到一個(gè)被稱為context的vector中。當(dāng)整個(gè)輸入序列都處理完之后，encoder將這個(gè)context發(fā)送給decoder，decoder開始逐個(gè)item生成輸出序列內(nèi)容。

The same applies in the case of machine translation.

類似地，在神經(jīng)機(jī)器翻譯場景下：

The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks (Be sure to check out Luis Serrano’s A friendly introduction to Recurrent Neural Networks for an intro to RNNs).

在神經(jīng)機(jī)器翻譯的例子中，context是一個(gè)vector（就是一個(gè)由數(shù)字組成的數(shù)組）。而encoder和decoder都是RNN。

The context is a vector of floats. Later in this post we will visualize vectors in color by assigning brighter colors to the cells with higher values.

圖示中的context是一個(gè)浮點(diǎn)數(shù)vector。為便于展示，本文中會(huì)對context中的數(shù)字填上背景色，數(shù)值越大顏色越淺，否則越深

You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.

你可以在創(chuàng)建模型之初對context的尺寸進(jìn)行設(shè)定，通常就是等于encoder RNN中隱藏單元的數(shù)量。上面示例中的conext尺寸是4，但實(shí)際應(yīng)用場景中一般用256、512或1024。

By design, a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms. These turn words into vector spaces that capture a lot of the meaning/semantic information of the words (e.g. king - man + woman = queen).

按照設(shè)計(jì)，RNN在每個(gè)time step里都接收兩項(xiàng)內(nèi)容：一個(gè)輸入（在encoder的例子中就是輸入語句的一個(gè)word）以及一個(gè)隱藏狀態(tài)。這里word需要表示成vector，將word轉(zhuǎn)換為vector的方法稱為詞嵌入（Word Embedding）算法。這種將單詞映射到矢量空間的過程會(huì)捕獲單詞中很多語義級(jí)別的信息（例如一個(gè)典型的例子：king - man + woman = queen）。

We need to turn the input words into vectors before processing them.

That transformation is done using a word embedding algorithm.

We can use pre-trained embeddings or train our own embedding on our dataset.

Embedding vectors of size 200 or 300 are typical, we're showing a vector of size four for simplicity.

在繼續(xù)處理單詞前，需要先通過詞嵌入算法將單詞轉(zhuǎn)換為矢量（即詞向量）。

獲取詞向量有很多途徑：既可以使用別人預(yù)訓(xùn)練好的詞向量，也可以在自己的數(shù)據(jù)集上訓(xùn)練出自己的詞向量。

典型的詞向量的長度一般是200或300。此處為便于演示，我們使用一個(gè)長度為4的詞向量。

Now that we’ve introduced our main vectors/tensors, let’s recap the mechanics of an RNN and establish a visual language to describe these models:

至此我們已經(jīng)介紹了主要的vector和tensor，接下來我們回顧一下RNN的運(yùn)行機(jī)制，并用可視化的方式來描述RNN模型。

【譯補(bǔ)】上述動(dòng)畫展示了time step1時(shí)，RNN接受第一個(gè)輸入vector（input vector #1）和起始的隱藏態(tài)vector（hidden state #0），經(jīng)過處理后產(chǎn)出了一個(gè)輸出vector（output vector #1）和一個(gè)新的隱藏態(tài)vector（hidden state #1）

The next RNN step takes the second input vector and hidden state #1 to create the output of that time step. Later in the post, we’ll use an animation like this to describe the vectors inside a neural machine translation model.

下一個(gè)time step中，RNN會(huì)獲取第二個(gè)輸入vector（input vecotr #2）和隱藏態(tài)vector（hidden state #1）來生成當(dāng)前time step的輸出項(xiàng)（包括output vector #2和hidden state #2）。下文中，我們用一個(gè)類似的動(dòng)畫來描述神經(jīng)機(jī)器翻譯模型內(nèi)部的vector。

In the following visualization, each pulse for the encoder or decoder is that RNN processing its inputs and generating an output for that time step. Since the encoder and decoder are both RNNs, each time step one of the RNNs does some processing, it updates its hidden state based on its inputs and previous inputs it has seen.

在接下來的可視化演示里，encoder或decoder的每一步都表示RNN在處理輸入并產(chǎn)出一個(gè)當(dāng)前時(shí)刻的輸出。因?yàn)?span style="color: rgba(0, 128, 0, 1)">encoder和decoder都是RNN，在每個(gè)time step中，都有其中一個(gè)RNN做一些處理，它會(huì)根據(jù)它當(dāng)前的輸入以及之前的輸入來更新它的隱藏狀態(tài)。

Let’s look at the hidden states for the encoder. Notice how the last hidden state is actually the context we pass along to the decoder.

讓我們看一下encoder中隱藏態(tài)的處理過程，注意最后一個(gè)隱藏態(tài)實(shí)際就是要傳遞給decoder的那個(gè)context。

The decoder also maintains a hidden states that it passes from one time step to the next. We just didn’t visualize it in this graphic because we’re concerned with the major parts of the model for now.

解碼器也會(huì)維護(hù)一個(gè)隱藏狀態(tài)并把它從一個(gè)time step傳給下一個(gè)time step。但我們此時(shí)先關(guān)注模型的主要部分，所以上圖中沒有顯示出解碼器的隱藏狀態(tài)處理過程。

Let’s now look at another way to visualize a sequence-to-sequence model. This animation will make it easier to understand the static graphics that describe these models. This is called an “unrolled” view where instead of showing the one decoder, we show a copy of it for each time step. This way we can look at the inputs and outputs of each time step.

現(xiàn)在讓我們用另一種方式來觀察seq2seq模型，下面這個(gè)動(dòng)畫可以更容易地理解和描述這些模型。我稱之為“展開視圖”，其并不直接顯示一個(gè)decoder，而是在每個(gè)time step顯示它的一個(gè)副本【譯注：就是把每個(gè)time step中的RNN及其處理過程都疊加到同一個(gè)畫面里】。通過這種方法我們可以看到每個(gè)time step的輸入和輸出過程。

Let’s Pay Attention Now

The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences. A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”, which highly improved the quality of machine translation systems. Attention allows the model to focus on the relevant parts of the input sequence as needed.

context已經(jīng)成為這類模型的一個(gè)瓶頸，它在模型處理長句子時(shí)給其帶來了巨大挑戰(zhàn)。針對此問題，這兩篇論文（Bahdanau et al., 2014 和 Luong et al., 2015 ）給出了一個(gè)可行的解決方案，文中提出并完善了一項(xiàng)稱之為注意力（attention）的技術(shù)，極大地提升了機(jī)器翻譯系統(tǒng)的運(yùn)行效果。注意力機(jī)制允許模型根據(jù)需要來關(guān)注輸入序列的相應(yīng)部分。

At time step 7, the attention mechanism enables the decoder to focus on the word "étudiant" ("student" in french) before it generates the English translation.

This ability to amplify the signal from the relevant part of the input sequence makes attention models produce better results than models without attention.

在第7步（即第7個(gè)時(shí)間步長）中，注意力機(jī)制使解碼器在生成英語翻譯結(jié)果之前，先聚焦于單詞"étudiant" （法語中的“學(xué)生”）。

這種“放大來自輸入序列相關(guān)部分的信號(hào)的”能力使得有注意力機(jī)制的模型比沒有注意力機(jī)制的模型產(chǎn)生更好的結(jié)果。

Let’s continue looking at attention models at this high level of abstraction. An attention model differs from a classic sequence-to-sequence model in two main ways:

First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder:

讓我們繼續(xù)從高度抽象的層次上研究注意力模型，注意力模型與傳統(tǒng)的seq2seq模型相比，主要有兩點(diǎn)不同：

首先，encoder傳遞給decoder的數(shù)據(jù)變多了。encoder不再只傳遞編碼階段最后一個(gè)隱藏態(tài)給decoder，而是將編碼階段的所有隱藏態(tài)全都傳遞給decoder。

Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:

1. Look at the set of encoder hidden states it received – each encoder hidden states is most associated with a certain word in the input sentence

2. Give each hidden states a score (let’s ignore how the scoring is done for now)

3. Multiply each hidden states by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores

其次，帶注意力的decoder在生成輸入序列之前，會(huì)多做一些額外的操作。具體來說，為了聚焦到輸入序列中“和當(dāng)前解碼time step相對應(yīng)的”的那個(gè)部分上，decoder做了如下事情：

1. 觀察編碼器產(chǎn)出的隱藏態(tài)集合，每個(gè)隱藏態(tài)都與輸入序列中某個(gè)特定單詞有著最大的關(guān)聯(lián)性。

2. 為每個(gè)隱藏態(tài)打分（具體打分方法此處先不解釋）

3. 將每個(gè)隱藏態(tài)與其對應(yīng)的經(jīng)過歸一化處理的打分相乘，使得分高的隱藏態(tài)突出放大，得分低的隱藏態(tài)被弱化掩蓋。

This scoring exercise is done at each time step on the decoder side.

這一系列操作會(huì)在decoder處理階段的每個(gè)time step里都執(zhí)行一次。

Let us now bring the whole thing together in the following visualization and look at how the attention process works:

The attention decoder RNN takes in the embedding of the <END> token, and an initial decoder hidden state.
The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
We concatenate h4 and C4 into one vector.
We pass this vector through a feedforward neural network (one trained jointly with the model).
The output of the feedforward neural networks indicates the output word of this time step.
Repeat for the next time steps

現(xiàn)在讓我們把所有環(huán)節(jié)整合到一起，看看注意力機(jī)制是如何工作的。

1. 帶有注意力機(jī)制的解碼器RNN，接收到兩個(gè)東西：(1) 表示編碼結(jié)束的特定詞向量令牌（即<END>令牌），和(2) 一個(gè)經(jīng)過初始化的解碼器隱藏狀態(tài)；

2. 解碼器RNN處理上述輸入，生成一個(gè)輸出和一個(gè)新的隱藏狀態(tài)向量（h4），注意這里的輸出會(huì)被丟棄。

3. 注意力生效步驟：使用編碼器隱藏狀態(tài)，和上述隱藏狀態(tài)向量（h4），計(jì)算出當(dāng)前時(shí)間步長的context向量（C4）；

4. 將h4和C4拼接成一個(gè)向量；

5. 將上述向量放到一個(gè)與模型共同訓(xùn)練的前饋神經(jīng)網(wǎng)絡(luò)中；

6. 前饋神經(jīng)網(wǎng)絡(luò)的產(chǎn)出就是當(dāng)前時(shí)間步長生成的輸出單詞；

7. 繼續(xù)下一個(gè)time step的操作；

This is another way to look at which part of the input sentence we’re paying attention to at each decoding step:

用另一種方式來觀察每個(gè)解碼步驟中，我們將注意力集中在輸入句子的哪個(gè)部分上：

Note that the model isn’t just mindless aligning the first word at the output with the first word from the input. It actually learned from the training phase how to align words in that language pair (French and English in our example). An example for how precise this mechanism can be comes from the attention papers listed above:

需要注意的是，模型并不只是盲目地將輸出中的第一個(gè)單詞與輸入中的第一個(gè)單詞對齊，它實(shí)際上從培訓(xùn)階段學(xué)會(huì)了如何排列語言對中的單詞(本例中是法語和英語)，上面列舉的論文中有一個(gè)展示這種注意力機(jī)制的準(zhǔn)確性的例子：

You can see how the model paid attention correctly when outputing "European Economic Area".

In French, the order of these words is reversed ("européenne économique zone") as compared to English.

Every other word in the sentence is in similar order.

可以看出該模型在輸出“European Economic”（歐洲經(jīng)濟(jì)區(qū)）這個(gè)詞組時(shí)，是如何將注意力正確地聚焦的。

在法語中，“歐洲經(jīng)濟(jì)區(qū)”這個(gè)詞組的詞序相對英語來說是相反的（"zone économique européenne"），而該句子中其他單詞的順序在兩種語言中都差不多。

（譯注：即模型在一個(gè)大部分單詞順序相同的句子翻譯任務(wù)中，準(zhǔn)確地發(fā)現(xiàn)了其中需要調(diào)整單詞順序的部分！）

If you feel you’re ready to learn the implementation, be sure to check TensorFlow’s Neural Machine Translation (seq2seq) Tutorial.

如果你覺得已經(jīng)準(zhǔn)備好學(xué)習(xí)其具體實(shí)現(xiàn)，一定要來看看Seq2Seq的tensorflow版本的教程：Neural Machine Translation (seq2seq) Tutorial.

【finished】

posted @ 2019-11-28 11:11 玄天妙地 Views(1218) Comments(0) 收藏舉報(bào)

刷新頁面返回頂部

眾妙之門

【轉(zhuǎn)載 | 翻譯】Visualizing A Neural Machine Translation Model（可視化講解神經(jīng)機(jī)器翻譯模型）

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

可視化講解神經(jīng)機(jī)器翻譯模型（基于注意力機(jī)制的Seq2Seq模型）

Looking under the hood

Let’s Pay Attention Now

公告