[논문 리뷰] Attention: Neural Machine Translation by Jointly Learning to Align and Translate

[논문 리뷰]/자연어처리

[논문 리뷰] Attention: Neural Machine Translation by Jointly Learning to Align and Translate

johyeongseob 2024. 11. 29. 17:03

저자: Dzmitry Bahdanau (Jacobs University Bremen, Germany) KyungHyun Cho and Yoshua Bengio* (Universit´e de Montr´eal)

* CIFAR Senior Fellow

인용: Bahdanau, Dzmitry. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

튜토리얼 코드: https://tutorials.pytorch.kr/intermediate/seq2seq_translation_tutorial.html

데이터셋 (한-영): https://github.com/jungyeul/korean-parallel-corpora/tree/master/korean-english-jhe

0. 초록 (Abstract)

신경망 기계 번역 (Neural Machine Translation, 이하 NMT)은 최근 기계 번역에서 제안된 접근법이다. 해당 모델들은 인코더-디코더 세트로 구성되어 있다. 인코더는 입력 문장을 고정된 길이의 벡터로 치환하고, 디코더는 해당 벡터를 이용하여 번역 문장을 생성한다. 하지만 고정된 길이의 벡터는 성능 향상에 발목을 잡는다. 그래서 저자는 새로운 모델을 제안한다. 새로운 모델은 번역 문장을 생성할 때, 예측 단어와 관련 있는 입력 문장의 단어를 자동적으로 찾아 이에 가중치를 준다. 새로운 모델은 영어-프랑스어 번역에서 SOTA를 달성하였다.

1. 서론 (Introduction)

NMT는 Kalchbrenner and Blunsom (2013), Sutskever et al. (2014) and Cho et al. (2014b) 등에 의해 최근에 제안된 기계번역 방법이다. 초록에서 언급하듯이, 인코더-디코더 접근법은 입력 문장의 필수 정보를 고정된 벡터로 압축한다. 이는 입력 문장이 길어질수록 성능 하락을 야기한다. 해당 문제를 해결하기 위해, 저자는 입력문장과 출력문장의 정렬(align)과 번역을 동시에 학습하는 인코더-디코더 모델을 제안한다. (논문 제목이 이를 방증한다.) 저자가 제안하는 모델은 입력 문장을 고정된 단일 벡터로 치환하지 않는다. 대신, 입력 문장을 벡터들의 순서로 치환한다. 그리고 번역 단어를 생성할 때, 자동적으로 입력 문장 벡터들 중 관련성이 높은 벡터(들) 를 선택한다. 이는 입력 문장의 길이와 상관없이 입력 문장의 중요한 정보들을 압축하지 않는다.

2. 배경: 신경망 기계 번역 (Background: Neural Machine Translation)

은 입력 문장 $x$ 에 대해, 상응하는 문장 $y$ 를 찾도록 조건부 확률을 최대화하는 작업이다. NMT는 입력문장과 출력문장의 쌍으로 이루어진 훈련 데이터를 이용하여 조건부 확률을 최대화하는 모델을 훈련시키는 작업이다.

2.1 RNN Encoder-Decoder
저자는 기본 구조인 RNN 인코더-디코더를 설명한다. 그리고 저자는 이 구조를 활용하여 align과 translate를 동시에 학습하는 새로운 모델을 제안한다. RNN 인코더 디코더는 다음과 같다. 인코더는 입력 문장 즉, $x = (x_1, ..., x_T)$ 를 벡터 $c$ 로 만든다.

$\begin{align*} & h_t=f(x_t, h_t-1) \tag{1} \end{align*}$

and

$\begin{align*} & c=q({h_1, ... , h_T}), \end{align*}$

여기서 $h_t \in \mathbb{R}^{n}$ 은 t 시점에서 hidden state이다. 디코더는 c를 이용하여 번역 문장의 단어들 $y=(y_1, ... , y_T)$ 을 순서대로 예측한다.

$\begin{align*} & p(\mathbf{y}) = \prod_{t=1}^{T} p(y_t \mid \{y_1, \cdots, y_{t-1}\}, c), \tag{2} \end{align*}$

3. Align과 Translate를 학습 (Learning to Align and Translate)

저자는 NMT를 위한 새로운 구조를 제안한다. 해당 모델의 구조는 bidirectional RNN을 인코더로 사용하고, 디코더는 번역 중 입력 문장을 검색하는 작업을 시행한다.

3.1 Decoder: General Description
저자는 조건부 확률을 다음과 같이 정의한다.

$\begin{align*} & p(y_i|y_1, ... , y_{i-1}, \mathbf{x}) = g(y_{i-1}, s_i, c_i), \tag{3} \end{align*}$
$y_i$ 는 이전 예측 단어 $y_{i-1}$ 와 $s_i$ 와 $c_i$ 를 통해 얻는다. $s_i$ 는 i번째 RNN hidden state이다.

$\begin{align*}     &s_i = f(s_{i-1}, y_{i-1}, c_i). \end{align*}$
$c_i$ 는 입력 문장의 정보가 담긴 일련의 벡터들 annotation ( $h_1, ... ,h_T$ )에 의존한다. $c_i$ 는 이 벡터들의 가중치 합으로 구성된다.
$\begin{align*}     & c_i = \sum_{j=1}^{T} \alpha_{ij} h_j.     \tag{4} \end{align*}$
가중치 $\alpha_{ij}$ 는 개별 벡터 $h_j$ 에 의해 계산된다.
$\begin{align*}     & \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})},     \tag{5} \end{align*}$
여기서 $e_{ij}$ 는 정렬 모델(alignment model) $a$ 에 의해 계산된다.
$\begin{align*}     & e_{ij}=a(s_{i-1}, h_j) \end{align*}$
정렬 모델은 번역 모델과 함께 훈련되며, 역전파에서 정렬 모델은 입력 문장의 개별 단어에 직접적으로 연결된다. 그리고 $\alpha_{ij}$ 는 $s_i$ 와 $y_i$ 에게 중요한 정보 $s_{i-1}$ 과 관련된 $h_j$ 의 내용을 반영한다.

3.2 Encoder: Bidirectional RNN for Annotating Sequences
(생략)

5. 실험 결과 (Result)

5.1 Quantitative Results

모델이 입력 문장(x-axis)을 번역(y-axis)할 때, 번역 문장의 각 단어에 적절한 입력문장의 단어를 잘 선택함을 알 수 있다.

7. 결론 (Conclusion)

기존 방법인 encoder-decoder의 한계인 고정된 길이의 context 벡터를 극복하기 위해 저자는 새로운 모델을 제안한다. 모델은 번역 시, 입력 문장의 각 단어에 동적 검색을 통하여 정보 손실을 방지하였다. 이는 특히 긴 문장에서 효과가 좋았다.

8. Pytorch code

1. RNN 인코더

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        # 입력 인덱스(input_indices)는 Embedding 가중치 행렬(input_size)의 특정 행(row)을 선택
        self.embedding = nn.Embedding(num_embeddings=input_size, embedding_dim=hidden_size)
        # hidden_size: 입력 크기와 출력 히든 상태의 크기 (같은 크기로 설정)
        self.gru = nn.GRU(input_size=hidden_size, hidden_size=hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))  # [batch_size, seq_len, hidden_size]
        # output: 모든 타임 스텝의 히든 상태 [batch_size, seq_len, hidden_size]
        # hidden: 마지막 타임 스텝의 히든 상태 [num_layers, batch_size, hidden_size]
        outputs, hidden = self.gru(embedded)
        return outputs, hidden

2. 바다나우 어텐션

class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, query, keys):
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))  # query: h_{dec,t}, keys: h_{enc,i}
        scores = scores.squeeze(2).unsqueeze(1)  # Softmax 적용 값은 마지막 차원: [batch_size, 4, 1] -> [batch_size, 1, 4]

        weights = self.softmax(scores)
        context = torch.bmm(weights, keys)  # [B,N,M] x [B,M,P] = [B,N,P]

        return context, weights

3. 어텐션 기반 디코더

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=output_size, embedding_dim=hidden_size)
        self.attention = BahdanauAttention(hidden_size)
        self.gru = nn.GRU(input_size=2 * hidden_size, hidden_size=hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)
        self.LogSoftmax = nn.LogSoftmax(dim=-1)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        # 시작 토큰(Start of Sentence Token, SOS_token), [B,1] = [[1],[1],...,[1]]
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)
        # hidden: 마지막 타임스텝의 히든 상태 [num_layers, batch_size, hidden_size]
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)

            if target_tensor is not None:
                # Teacher forcing 포함: 목표를 다음 입력으로 전달
                decoder_input = target_tensor[:, i].unsqueeze(1)  # Teacher forcing
            else:
                # Teacher forcing 미포함: 자신의 예측을 다음 입력으로 사용
                _, top_idx = decoder_output.topk(1)
                decoder_input = top_idx.squeeze(-1).detach()  # 입력으로 사용할 부분을 히스토리에서 분리

        # [Batch_size, seq_len, vocab_size] -> vocab_size 에 대한 Softmax를 사용하여 추정 단어를 선택
        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = self.LogSoftmax(decoder_outputs)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_hidden, attentions


    def forward_step(self, input, hidden, encoder_outputs):
        # decoder_input: [batch_size, 1, hidden_size]
        embedded = self.dropout(self.embedding(input))

        # 디코더의 히든 상태: [batch_size, 1, hidden_size]
        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.gru(input_gru, hidden)
        # *목표 언어의 모든 단어에 대한 점수(logit)
        output = self.out(output)

        return output, hidden, attn_weights

'[논문 리뷰] > 자연어처리' 카테고리의 다른 글

[논문 리뷰] Visual Instruct Tuning (LLaVA) (0)	2025.01.15
[논문 리뷰] DoRA: Weight-Decomposed Low-Rank Adaptation (0)	2024.12.27
[논문 리뷰] Seq2Seq: Sequence to Sequence Learning with Neural Networks (0)	2024.12.10
[논문 리뷰] LSTM: Long Short-Term Memory (0)	2024.12.03
[논문 리뷰] LORA: Low-Rank Adaptation of Large Language Models (1)	2024.11.26

현재글[논문 리뷰] Attention: Neural Machine Translation by Jointly Learning to Align and Translate

johyeongseob 님의 블로그 공부, 기록, 일상. 문의 : johs@dgu.ac.kr

johyeongseob 님의 블로그

공부, 기록, 일상. 문의 : johs@dgu.ac.kr

feature fusion, pytorch, deep supervision, supervised contrastive learning, 자료구조, channel attention, multi-light source, defect detection, supcon,

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

johyeongseob 님의 블로그