์ด ๊ธ€์—์„œ๋Š” attention์ด ๋ฌด์—‡์ธ์ง€, ๋ช‡ ๊ฐœ์˜ ์ค‘์š”ํ•œ ๋…ผ๋ฌธ๋“ค์„ ์ค‘์‹ฌ์œผ๋กœ ์ •๋ฆฌํ•˜๊ณ  NLP์—์„œ ์–ด๋–ป๊ฒŒ ์“ฐ์ด๋Š” ์ง€๋ฅผ ์ •๋ฆฌํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

๋ชฉ์ฐจ

๊ธฐ์กด Encoder-Decoder ๊ตฌ์กฐ์—์„œ ์ƒ๊ธฐ๋Š” ๋ฌธ์ œ

Encoder-Decoder ๊ตฌ์กฐ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์€ input sequence๋ฅผ ์–ด๋–ป๊ฒŒ vectorํ™”ํ•  ๊ฒƒ์ด๋ƒ๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ NLP์—์„œ๋Š” input sequence์ด๊ฐ€ dynamicํ•  ๊ตฌ์กฐ์ผ ๋•Œ๊ฐ€ ๋งŽ์œผ๋ฏ€๋กœ, ์ด๋ฅผ ๊ณ ์ •๋œ ๊ธธ์ด์˜ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค๋ฉด์„œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ฆ‰, “์•ˆ๋…•” ์ด๋ผ๋Š” ๋ฌธ์žฅ์ด๋‚˜ “์˜ค๋Š˜ ๋‚ ์”จ๋Š” ์ข‹๋˜๋ฐ ๋ฏธ์„ธ๋จผ์ง€๋Š” ์‹ฌํ•˜๋‹ˆ๊น ๋‚˜๊ฐˆ ๋•Œ ๋งˆ์Šคํฌ ๊ผญ ์“ฐ๊ณ  ๋‚˜๊ฐ€๋ ด!” ์ด๋ผ๋Š” ๋ฌธ์žฅ์ด ๋‹ด๊ณ  ์žˆ๋Š” ์ •๋ณด์˜ ์–‘์ด ๋งค์šฐ ๋‹ค๋ฆ„์—๋„ encoder-decoder๊ตฌ์กฐ์—์„œ๋Š” ๊ฐ™์€ ๊ธธ์ด์˜ vector๋กœ ๋ฐ”๊ฟ”์•ผ ํ•˜์ฃ . Attention์€ ๊ทธ ๋‹จ์–ด์—์„œ ์•Œ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, sequence data์—์„œ ์ƒํ™ฉ์— ๋”ฐ๋ผ ์–ด๋А ๋ถ€๋ถ„์— ํŠนํžˆ ๋” ์ฃผ๋ชฉ์„ ํ•ด์•ผํ•˜๋Š” ์ง€๋ฅผ ๋ฐ˜์˜ํ•จ์œผ๋กœ์จ ์ •๋ณด ์†์‹ค๋„ ์ค„์ด๊ณ  ๋” ์ง๊ด€์ ์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Basic Idea (Bahdanau Attention)

๋…ผ๋ฌธ : Neural Machine Translation by Jointly Learning to Align and Translate

๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์•„์ด๋””์–ด๋Š” encodeํ•  ๋•Œ๋Š” ๊ฐ๊ฐ์˜ ๋‹จ์–ด๋ฅผ vector๋กœ ๋งŒ๋“ค๊ณ , ๊ฐ๊ฐ์„ attention weight์— ๋”ฐ๋ผ weighted sum์„ ํ•œ ๋‹ค์Œ, ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์Œ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ผ ์ง€๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์€ ์ด ๋ฐฉ์‹์„ NMT์— ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ์š”, bidirectional RNN์„ encoder๋กœ ์‚ฌ์šฉํ•˜๊ณ , $i$๋ฒˆ์งธ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•œ encoder output์„ ํ•ฉ์ณ์„œ context vector๋กœ ๋งŒ๋“œ๋Š”๋ฐ, ์ด ๋•Œ ๋‹จ์ˆœ sum์ด ์•„๋‹Œ weight $\alpha_{ij}$๋ฅผ ๊ณฑํ•ด์„œ weighted sum์„ ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค(์•„๋ž˜ ์ฒซ๋ฒˆ์งธ ์ˆ˜์‹). ์ด ๋•Œ $i$๋ฒˆ์งธ ๋‹จ์–ด์— ๋Œ€ํ•œ $j$๋ฒˆ์งธ ๋‹จ์–ด์˜ attention weight๋Š” ์•„๋ž˜ ์ˆ˜์‹ ์ฒ˜๋Ÿผ $i$๋ฒˆ์งธ ๋‹จ์–ด์™€ $j$๋ฒˆ์งธ์˜ ์›๋ž˜ encoder output๋ผ๋ฆฌ๋ฅผ feedforward neural network(attention weight๋ฅผ ๋งŒ๋“œ๋Š” ๋ชจ๋ธ์„ ๋…ผ๋ฌธ์—์„œ๋Š” align ๋ชจ๋ธ์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค)๋ฅผ ํƒœ์›Œ์„œ ๋งŒ๋“ญ๋‹ˆ๋‹ค(์•„๋ž˜ ๋‘๋ฒˆ์งธ ์ˆ˜์‹).

$$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} $$

$$ e_{ij} = a(s_{i-1}, h_j) $$

align ๋ชจ๋ธ์„ Multi-layer Perceptron์œผ๋กœ ๋งŒ๋“  ์ด์œ ๋Š” ๋น„์„ ํ˜•์„ฑ์„ ๋ฐ˜์˜ํ•˜๊ณ ์ž ํ•œ ๊ฒƒ์ด๋ผ๊ณ  ํ•˜๊ตฌ์š”, ๊ฒฐ๊ตญ ์ด align ๋ชจ๋ธ์€ NMT์—์„œ ๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๋ฅผ ์ž˜ ์ •๋ ฌํ•˜๊ณ (๊ทธ๋ž˜์„œ align) ์ง์ง€์–ด ์ฃผ๊ธฐ ์œ„ํ•ด์„œ ์žˆ๋Š” ๊ฒ๋‹ˆ๋‹ค. NMT์—์„œ์˜ cost function ์ž์ฒด๋ฅผ loss๋กœ backpropagation ํ–ˆ๊ตฌ์š”.

Attention Score Functions

์œ„ ๋…ผ๋ฌธ ์ดํ›„๋กœ ์ด attention score $\alpha_{ij}$๋ฅผ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค ์ง€์— ๋Œ€ํ•œ ๋ช‡๊ฐ€์ง€ ๋ณ€ํ˜•๋“ค์ด ์ƒ๊ฒผ๋Š”๋ฐ์š”, ์ด๋ฅผ ์ •๋ฆฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‹จ์–ด๋ฅผ ํ†ต์ผํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค๊ณ ์ž ํ•˜๋Š” decoder state๋ฅผ $q$ (query vector), ์—ฌ๊ธฐ์— ์“ฐ์ด๋Š” ๋ชจ๋“  encoder states๋ฅผ $k$ (key vector)๋ผ๊ณ  ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค(์ด๋Š” ๋’ค์—์„œ ๋‹ค๋ฃฐ Attention is All You Need ๋…ผ๋ฌธ์—์„œ ๋‚˜์˜จ ์ •์˜์ž…๋‹ˆ๋‹ค). ์ด ๋‹จ์–ด๋ฅผ ์ด์šฉํ•œ๋‹ค๋ฉด $\alpha_{ij}$๋Š” $i$๋ฒˆ์งธ์˜ query vector๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ $i-j$ key vector๋“ค ์‚ฌ์ด์˜ attention score๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒ ์ฃ .

(1) Multi-layer Perceptron (Bahdanau et al. 2015)

$$ a(q, k) = w_2^T \tanh (W_1[q;k]) $$

์œ„ ๋…ผ๋ฌธ์˜ MLP๋ฅผ ๋‹ค์‹œ ์ ์€ ๊ฑด๋ฐ์š”, ์ด ๋ฐฉ๋ฒ•์€ ๋‚˜๋ฆ„ ์œ ์—ฐํ•˜๊ณ  ํฐ ๋ฐ์ดํ„ฐ์— ํ™œ์šฉํ•˜๊ธฐ ์ข‹๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

(2) Bilinear (Luong et al. 2015)

$$ a(q, k) = q^TWk $$

๊ฐ™์€ ์—ฐ๋„์— ๋‚˜์˜จ Lunong Attention์€ $q$์™€ $k$ ์‚ฌ์ด์— weight matrix $W$ ํ•˜๋‚˜๋ฅผ ๊ณฑํ•ด์„œ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค.

(3) Dot Product (Luong et al. 2015)

$$ a(q, k) = q^Tk $$

2์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, ๊ทธ๋ƒฅ $q$์™€ $k$๋ฅผ dot productํ•ด์„œ ์ด๋ฅผ attention์œผ๋กœ ์“ฐ๋Š” ๋ฐฉ๋ฒ•๋„ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์•„์˜ˆ ํ•™์Šต์‹œํ‚ฌ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ข‹์ง€๋งŒ, $q$์™€ $k$์˜ ๊ธธ์ด๋ฅผ ๊ฐ™๊ฒŒ ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

(4) Scaled Dot Product (Vaswani et al. 2017)

$$ a(q, k) = \frac{q^Tk}{\sqrt{\mid{k}\mid}} $$

์ตœ๊ทผ์— ๋‚˜์˜จ ๋…ผ๋ฌธ ์ค‘์—์„œ 3์„ ๊ฐœ์„  ์‹œํ‚จ ๋…ผ๋ฌธ์ธ๋ฐ์š”. ๊ธฐ๋ณธ์ ์ธ ์ ‘๊ทผ์€ dot product ๊ฒฐ๊ณผ๊ฐ€ $q$์™€ $k$์˜ ์ฐจ์›์— ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€ํ•˜๋ฏ€๋กœ, ์ด๋ฅผ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ ์ฃผ๋Š” ๊ฒ๋‹ˆ๋‹ค.

What Do We Attend To?

์ง€๊ธˆ๊นŒ์ง€์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋‹ค input sentence์˜ RNN output์—๋‹ค๊ฐ€ attention์„ ์จ์„œ ์ด๋ฅผ decoding์— ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ์ข€๋” ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ attention์„ ๋งฅ์ด๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

(1) Input Sentence

๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ทธ ์ „/ ๊ทธ ํ›„ input sentence๋“ค์—๋‹ค๊ฐ€ attention์„ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

- Copying Mechanism (Gu et al. 2016)

๋…ผ๋ฌธ : Incorporating Copying Mechanism in Sequence-to-Sequence Learning

์ด ๋ฐฉ๋ฒ•์€ output sequence์— input sequences์˜ ๋‹จ์–ด๋“ค์ด ์ž์ฃผ ์ค‘๋ณต๋  ๋•Œ, ์ด๋ฅผ ์ž˜ copyํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋Œ€ํ™”๋ฅผ ์ด๋Œ์–ด ๋‚˜๊ฐˆ ๋•Œ, ๊ธฐ์กด์— ๋‚˜์™”๋˜ ๋‹จ์–ด๋“ค์„ ํ™œ์šฉํ•ด์„œ ๋Œ€๋‹ตํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ฃ .

์ด ๊ธ€์€ ์›๋ณธ์˜ ์ผ๋ถ€๋งŒ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๋‚ด์šฉ์€ ์ด์ „ ๋ธ”๋กœ๊ทธ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.