Tags: , ,

Categories:

Updated:


RNN

Vanila RNN

  • Context? use the notion of recurrence.
  • Two outputs: Prediction and State (state of recurrent neural network)

Learning goals

  1. Let r = input vector dimension, s = hidden state dimension, and t = output dimension.
Us×r,Ws×s,Vt×ssi=f(UiTx+WiTsi+bi)U_{s\times r}, W_{s\times s}, V_{t\times s}\\ s_i = f(U_i^T x + W_i^T s_i + b_i)
  1. BPTT: Backpropagation through time is a slight variation.
    And due to vanishing/exploding gradient, it is sensitive to the length of the sequence. For text data: So we set a maximum length and if input is longer than that, we truncate it and if it is shorter, we pad it with zeros.

LSTM

  • Why not Vanila RNN? Transition Matrix necessarily weakens signal, need a structure that can leave some dimensions unchanged over many steps.
  • Augment with gate units
    • Input Gate
    • Forget Gate
    • Output Gate
    Ws×r,Ws×s,Ws×t,bs×1,bs×1,bs×1si=f(WiTsi1+UiTx+bi)W_{s\times r}, W_{s\times s}, W_{s\times t}, b_{s\times 1}, b_{s\times 1}, b_{s\times 1}\\ s_i = f(W_i^T s_{i-1} + U_i^T x + b_i)
i=σ(WiTsi1+UiTx+bi)f=σ(WfTsi1+UfTx+bf)o=σ(WoTsi1+UoTx+bo)g=WgTsi1+UgTx+bgci=figi+si1hi=oiσ(ci)\begin{align} i &= \sigma(W_i^T s_{i-1} + U_i^T x + b_i)\\ f &= \sigma(W_f^T s_{i-1} + U_f^T x + b_f)\\ o &= \sigma(W_o^T s_{i-1} + U_o^T x + b_o)\\ g &= W_g^T s_{i-1} + U_g^T x + b_g\\ c_i &= f_i \odot g_i + s_{i-1}\\ h_i &= o_i \odot \sigma(c_i) \end{align}

GRU

  • Removd cell state
  • Augment with reset gate
    • Reset gate
    • Update gate
  • Perform similary, with less computation than LSTM
r=σ(WrTsi1+UrTx+br)g=WgTsi1+UgTx+bgci=figi+si1hi=oiσ(ci)\begin{align} r &= \sigma(W_r^T s_{i-1} + U_r^T x + b_r)\\ g &= W_g^T s_{i-1} + U_g^T x + b_g\\ c_i &= f_i \odot g_i + s_{i-1}\\ h_i &= o_i \odot \sigma(c_i) \end{align}

Comparison

RNN vs LSTM

  • RNN:
    • Does not have memory
    • No gates
  • LSTM:
    • Has memory
    • Has gates
    • Can be used for sequence to sequence learning

RNN vs GRU

  • RNN:
    • Has memory
    • Has gates
    • Can be used for sequence to sequence learning
  • GRU:
    • Has memory
    • No gates
    • Can be used for sequence to sequence learning

RNN vs Convolutional Neural Network

  • RNN:
    • Can be used for sequence to sequence learning
  • Convolutional Neural Network:
    • Can be used for image recognition
    • Can be used for video recognition

Additional topics

Beam Search introduced to solve Greddy Inference. In seq2seq, the decoder predict sequence of words one by one. If it produces one wrong word, we may end up completely wrong sequence of words.
So Beam Search produces multiple different hypothesis to produce and then see which full sentence is most likely

Attention

Look to see how close the vector in one language is to the word in out decoder. This works well when order of words are different.

Leave a comment