7 0 5 MB
Deep Sequence Modeling Ava Soleimany MIT 6.S191 January 27, 2020
6.S191 Introduction to Deep Learning
introtodeeplearning.com
@MITDeepLearning
Given an image of a ball, can you predict where it will go next?
T I M
S . 6
1 9 1
Given an image of a ball, can you predict where it will go next?
T I M
S . 6
1 9 1 ???
Given an image of a ball, can you predict where it will go next?
T I M
S . 6
1 9 1
Given an image of a ball, can you predict where it will go next?
T I M
S . 6
1 9 1
Sequences in the Wild
T I M
S . 6
1 9 1
Audio
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Sequences in the Wild character:
S . 6
1 9 1
6.S191 Introduction to Deep Learning word:
T I M
Text
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
S . 6
1 9 1
A Sequence Modeling Problem: Predict the Next Word
T I M
A Sequence Modeling Problem: Predict the Next Word “This morning I took my cat for a walk.”
T I M
S . 6
1 9 1
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
A Sequence Modeling Problem: Predict the Next Word “This morning I took my cat for a walk.” given these words
T I M
S . 6
1 9 1
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
A Sequence Modeling Problem: Predict the Next Word “This morning I took my cat for a walk.” given these words
T I M
S . 6
1 9 1
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
predict the next word
H. Suresh, 6.S191 2018. 1/27/20
Idea #1: Use a Fixed Window “This morning I took my cat for a walk.”
1 9 1
given these two words
T I M
S . 6
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
predict the next word
H. Suresh, 6.S191 2018. 1/27/20
Idea #1: Use a Fixed Window “This morning I took my cat for a walk.”
1 9 1
given these two words
S . 6
predict the next word
One-hot feature encoding: tells us what each word is
[1000001000]
T I M
for
a
prediction
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Problem #1: Can’t Model Long-Term Dependencies
1 9 1
“France is where I grew up, but I now live in Boston. I speak fluent ___.”
S . 6
J’aime 6.S191!
T I M
We need information from the distant past to accurately predict the correct word.
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Idea #2: Use Entire Sequence as Set of Counts “This morning I took my cat for a”
S . 6
“bag of words”
1 9 1
[ 0 1 0 0 1 0 0 … 0 0 1 1 0 0 0 1]
T I M
prediction
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Problem #2: Counts Don’t Preserve Order
1 9 1
The food was good, not bad at all.
S . 6 vs.
T I M
The food was bad, not good at all.
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Idea #3: Use a Really Big Fixed Window “This morning I took my cat for a walk.” given these words
S . 6
1 9 1 predict the next word
[10000000010010001000 00010 … ] morning
I
T I M
took
this
cat
prediction
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Problem #3: No Parameter Sharing
1 9 1
[10000000010010001000 00010 … ] this
morning
took
the
S . 6
cat
Each of these inputs has a separate parameter:
T I M
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Problem #3: No Parameter Sharing
1 9 1
[10000000010010001000 00010 … ] this
morning
took
the
S . 6
cat
Each of these inputs has a separate parameter:
T I M
[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ] this
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
morning
H. Suresh, 6.S191 2018. 1/27/20
Problem #3: No Parameter Sharing
1 9 1
[10000000010010001000 00010 … ] this
morning
took
the
S . 6
cat
Each of these inputs has a separate parameter:
T I M
[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ] this
morning
Things we learn about the sequence won’t transfer if they appear elsewhere in the sequence. 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Sequence Modeling: Design Criteria To model sequences, we need to: 1.
Handle variable-length sequences
S . 6
2. Track long-term dependencies 3.
Maintain information about order
4.
Share parameters across the sequence
T I M
1 9 1
RNN
Today: Recurrent Neural Networks (RNNs) as an approach to sequence modeling problems 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
S . 6
1 9 1
Recurrent Neural Networks (RNNs)
T I M
Standard Feed-Forward Neural Network #"
!
T I M
S . 6
1 9 1
One to One “Vanilla” neural network
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Networks for Sequence Modeling #"
!
T I M
One to One “Vanilla” neural network
S . 6
1 9 1
Many to One Sentiment Classification
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Networks for Sequence Modeling #"
!
T I M
One to One “Vanilla” neural network
S . 6
Many to One Sentiment Classification
1 9 1
Many to Many Music Generation 6.S191 Lab!
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Networks for Sequence Modeling #"
!
T I M
One to One “Vanilla” neural network
S . 6
Many to One Sentiment Classification
1 9 1
… and many other architectures and applications
Many to Many Music Generation 6.S191 Lab!
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Standard “Vanilla” Neural Network output vector
$#"
T I M
input vector
S . 6
1 9 1
!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Network (RNN) output vector
$#"
RNN
ℎ"
T I M
input vector
S . 6
1 9 1
!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Network (RNN) output vector
$#"
RNN
ℎ"
T I M recurrent cell
input vector
S . 6
1 9 1
!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Network (RNN) output vector
RNN
ℎ"
T I M recurrent cell
input vector
1 9 1
Apply a recurrence relation at every time step to process a sequence:
$#"
S . 6
!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Network (RNN) output vector
RNN
"
ℎ"
T I M recurrent cell
input vector
1 ℎ = $1(ℎ9 , * ) S . 6 Apply a recurrence relation at every time step to process a sequence:
-,"
cell state
%
"'(
function old state parameterized by W
"
input vector at time step t
*"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Recurrent Neural Network (RNN) output vector
RNN
"
ℎ"
T I M recurrent cell
input vector
*"
1 ℎ = $1(ℎ9 , * ) S . 6 Apply a recurrence relation at every time step to process a sequence:
-,"
cell state
%
"'(
function old state parameterized by W
"
input vector at time step t
Note: the same function and set of parameters are used at every time step
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNN Intuition
1 9 1 output vector
my_rnn = RNN() hidden_state = [0, 0, 0, 0]
S . 6
sentence = ["I", "love", "recurrent", "neural"] for word in sentence:
T I M
prediction, hidden_state = my_rnn(word, hidden_state)
$#"
RNN
recurrent cell
ℎ"
next_word_prediction = prediction # >>> "networks!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
input vector
!" 1/27/20
RNN Intuition
1 9 1 output vector
my_rnn = RNN() hidden_state = [0, 0, 0, 0]
S . 6
sentence = ["I", "love", "recurrent", "neural"] for word in sentence:
T I M
prediction, hidden_state = my_rnn(word, hidden_state)
$#"
RNN
recurrent cell
ℎ"
next_word_prediction = prediction # >>> "networks!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
input vector
!" 1/27/20
RNN Intuition
1 9 1 output vector
my_rnn = RNN() hidden_state = [0, 0, 0, 0]
S . 6
sentence = ["I", "love", "recurrent", "neural"] for word in sentence:
T I M
prediction, hidden_state = my_rnn(word, hidden_state)
$#"
RNN
recurrent cell
ℎ"
next_word_prediction = prediction # >>> "networks!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
input vector
!" 1/27/20
RNN State Update and Output output vector
$#"
RNN
ℎ"
T I M recurrent cell
input vector
S . 6
1 9 1
!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNN State Update and Output output vector
$#"
RNN
ℎ"
T I M recurrent cell
input vector
!"
S . 6
1 9 1
Input Vector
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
!"
1/27/20
RNN State Update and Output output vector
$#"
RNN
Update Hidden State
ℎ"
T I M recurrent cell
input vector
!"
S . 6
1 9 1
ℎ" = tanh(,.-- ℎ"/0 + ,.2- !" ) Input Vector
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
!"
1/27/20
RNN State Update and Output output vector
Output Vector
"!#
RNN
input vector
*#
S . 6
Update Hidden State
ℎ#
T I M recurrent cell
1 9 1
"!# = %(&' ℎ#
ℎ# = tanh(%(&& ℎ#01 + %(3& *# ) Input Vector
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
*#
1/27/20
RNNs: Computational Graph Across Time
$#"
RNN
!"
T I M =
S . 6
1 9 1
Represent as computational graph unrolled across time
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNNs: Computational Graph Across Time
$#"
RNN
!"
$#%
T I M =
S . 6
1 9 1
!%
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNNs: Computational Graph Across Time
$#"
RNN
!"
$#%
T I M =
!%
S . 6 $#&
1 9 1
!&
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNNs: Computational Graph Across Time
$#"
RNN
!"
$#%
T I M =
!%
S . 6 $#&
!&
1 9 1 $#'
…
$#"
!'
…
!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNNs: Computational Graph Across Time
$#"
RNN
$#%
T I M =
()*
!"
!%
S . 6 $#&
()*
!&
1 9 1 $#'
()*
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
…
$#"
()*
!" 1/27/20
RNNs: Computational Graph Across Time
$#"
RNN
$#%
T I M =
(**
()*
!"
!%
S . 6 $#&
1 9 1 $#'
(**
()*
!&
…
$#"
(**
()*
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
()*
!" 1/27/20
RNNs: Computational Graph Across Time
$#"
$#% (*+
RNN
T I M =
(**
()*
!"
!%
S . 6 $#&
(*+
1 9 1 $#'
(*+
(**
()*
!&
…
$#" (*+
(**
()*
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
()*
!" 1/27/20
RNNs: Computational Graph Across Time
1 9 1
Re-use the same weight matrices at every time step $#"
$#% (*+
RNN
T I M =
(**
()*
!"
!%
S . 6 $#&
(*+
$#'
(*+
(**
()*
!&
…
$#" (*+
(**
()*
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
()*
!" 1/27/20
RNNs: Computational Graph Across Time Forward pass
(% $#"
(&
$#% *,-
RNN
T I M =
*,,
*+,
!"
!%
S . 6 $#&
*,-
1 9 1 ('
$#'
*+,
!&
…
*,-
*,,
() $#" *,-
*,,
*+,
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
*+,
!" 1/27/20
RNNs: Computational Graph Across Time Forward pass
(
(% $#"
(&
$#% *,-
RNN
T I M =
*,,
*+,
!"
!%
S . 6 $#&
*,-
1 9 1 ('
$#'
*+,
!&
…
*,-
*,,
() $#" *,-
*,,
*+,
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
*+,
!" 1/27/20
RNNs from Scratch class MyRNNCell(tf.keras.layers.Layer): def __init__(self, rnn_units, input_dim, output_dim): super(MyRNNCell, self).__init__() # Initialize weight matrices self.W_xh = self.add_weight([rnn_units, input_dim]) self.W_hh = self.add_weight([rnn_units, rnn_units]) self.W_hy = self.add_weight([output_dim, rnn_units]) # Initialize hidden state to zeros self.h = tf.zeros([rnn_units, 1])
T I M
S . 6
1 9 1
def call(self, x): # Update the hidden state self.h = tf.math.tanh( self.W_hh * self.h + self.W_xh * x ) # Compute the output output = self.W_hy * self.h
output vector
$#"
RNN
recurrent cell
input vector
ℎ"
!"
# Return the current output and hidden state return output, self.h 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNN Implementation in TensorFlow
1 9 1 output vector
S . 6
tf.keras.layers.SimpleRNN(rnn_units)
T I M
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
$#"
RNN
recurrent cell
input vector
ℎ"
!" 1/27/20
S . 6
1 9 1
Backpropagation Through Time (BPTT)
T I M
Recall: Backpropagation in Feed Forward Models "
!
T I M
1 9 1
Backpropagation algorithm: 1. Take the derivative (gradient) of the loss with respect to each parameter 2. Shift parameters in order to minimize loss
S . 6
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
RNNs: Backpropagation Through Time Forward pass
(
(% $#"
(&
$#% *,-
RNN
T I M =
*,,
*+,
!"
!%
S . 6 $#&
*,-
1 9 1 ('
$#'
*+,
!&
…
*,-
*,,
() $#" *,-
*,,
*+,
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
*+,
!" 1/27/20
RNNs: Backpropagation Through Time Forward pass Backward pass
(
(% $#"
(&
$#% *,-
RNN
T I M =
*,,
*+,
!"
!%
S . 6 $#&
*,-
1 9 1 ('
$#'
*+,
!&
…
*,-
*,,
() $#" *,-
*,,
*+,
!'
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
…
*+,
!" Mozer Complex Systems 1989. 1/27/20
Standard RNN Gradient Flow ℎ" '()
#"
'))
')) '()
'()
#$
T I M
1 …9 1 '))
S . 6 #%
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
ℎ&
'()
#&
1/27/20
Standard RNN Gradient Flow ℎ" '()
#"
'))
')) '()
'()
#$
1 …9 1 '))
S . 6 #%
ℎ&
'()
#&
Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
T I M
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Standard RNN Gradient Flow: Exploding Gradients ℎ" '()
'))
')) '()
#"
'()
#$
1 …9 1 '))
S . 6 #%
ℎ&
'()
#&
Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
T I M
Many values > 1: exploding gradients Gradient clipping to scale big gradients
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Standard RNN Gradient Flow:Vanishing Gradients ℎ" '()
'))
')) '()
#"
'()
#$
1 …9 1 '))
S . 6 #%
ℎ&
'()
#&
Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!
T I M
Many values > 1: exploding gradients Gradient clipping to scale big gradients
Many values < 1: vanishing gradients 1. Activation function 2. Weight initialization 3. Network architecture
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem?
T I M
S . 6
1 9 1
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together
T I M
S . 6
1 9 1
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together
S . 6
Errors due to further back time steps have smaller and smaller gradients
T I M
1 9 1
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together
S . 6
Errors due to further back time steps have smaller and smaller gradients
T I M
1 9 1
Bias parameters to capture short-term dependencies
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together
S . 6
Errors due to further back time steps have smaller and smaller gradients
T I M
“The clouds are in the ___”
1 9 1
Bias parameters to capture short-term dependencies
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem?
“The clouds are in the ___” $#"
Multiply many small numbers together
S . 6
Errors due to further back time steps have smaller and smaller gradients
T I M
!"
1 9 1 $#%
$#&
$#'
$#(
!%
!&
!'
!(
Bias parameters to capture short-term dependencies
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem?
“The clouds are in the ___” $#"
Multiply many small numbers together
S . 6
Errors due to further back time steps have smaller and smaller gradients
T I M
!"
1 9 1 $#%
$#&
$#'
$#(
!%
!&
!'
!(
“I grew up in France, … and l speak fluent___ ”
Bias parameters to capture short-term dependencies
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
The Problem of Long-Term Dependencies Why are vanishing gradients a problem?
“The clouds are in the ___” $#"
Multiply many small numbers together
S . 6
Errors due to further back time steps have smaller and smaller gradients
T I M
!"
1 9 1 $#%
$#&
$#'
$#(
!%
!&
!'
!(
“I grew up in France, … and l speak fluent___ ”
$#"
$#%
…
$#)
$#)*%
!"
!%
…
!)
!)*%
Bias parameters to capture short-term dependencies
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Trick #1: Activation Functions
1 9 1
ReLU derivative
T I M
S . 6
Using ReLU prevents ! " from shrinking the gradients when # > 0
tanh derivative
sigmoid derivative
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Trick #2: Parameter Initialization Initialize weights to identity matrix Initialize biases to zero
T I M
S . 6
1 9 1
This helps prevent the weights from shrinking to zero.
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Solution #3: Gated Cells Idea: use a more complex recurrent unit with gates to control what information is passed through
T I M
S . 6
gated cell
1 9 1
LSTM, GRU, etc.
Long Short Term Memory (LSTMs) networks rely on a gated cell to track information throughout many time steps. 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
S . 6
1 9 1
Long Short Term Memory (LSTM) Networks
T I M
Standard RNN In a standard RNN, repeating modules contain a simple computation node #"
#"&%
S . 6
ℎ"&% tanh
ℎ"
tanh
T I M
!"&%
1 9 1
#"
%$!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
tanh
!"
%$1/27/20
Long Short Term Memory (LSTMs) LSTM modules contain computational blocks that control information flow #"
#")*
S . 6
tanh
T I M
$ $ tanh
!")*
tanh
$
$ $ tanh
1 9 1
$
!"
#"+*
tanh
$ $ tanh
$
!"+*
LSTM cells are able to track information throughout many timesteps tf.keras.layers.LSTM(num_units) 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Hochreiter & Schmidhuber, Neural Computation 1997. 1/27/20
Long Short Term Memory (LSTMs) Information is added or removed through structures called gates
T I M
S . 6 !
1 9 1
Gates optionally let information through, for example via a sigmoid neural net layer and pointwise multiplication 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
Long Short Term Memory (LSTMs) How do LSTMs work? 1) Forget 2) Store 3) Update 4) Output
1 9 1 !"
S . 6
(")*
T I M
tanh
-"
ℎ")*
("
#
#
tanh
#
ℎ"
," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Hochreiter & Schmidhuber, Neural Computation 1997. 1/27/20
Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output LSTMs forget irrelevant parts of the previous state
1 9 1 !"
S . 6
(")*
T I M
tanh
-"
ℎ")*
("
#
#
tanh
#
ℎ"
," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Olah, “Understanding LSTMs”. 1/27/20
Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output LSTMs store relevant new information into the cell state
1 9 1 !"
S . 6
#"
%$T I M ℎ"
%$-"
&
&
#"
tanh
tanh
&
ℎ"
," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Olah, “Understanding LSTMs”. 1/27/20
Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output LSTMs selectively update cell state values
1 9 1 %"
S . 6
,"#$
T I M ℎ"#$
,"
tanh
'
'
tanh
'
ℎ"
&" 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Olah, “Understanding LSTMs”. 1/27/20
Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output The output gate controls what information is sent to the next time step
1 9 1 !"
S . 6
(")*
T I M ℎ")*
("
tanh
-"
#
#
tanh
#
ℎ"
," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Olah, “Understanding LSTMs”. 1/27/20
Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output
1 9 1 !"
S . 6
(")*
T I M
tanh
-"
ℎ")*
("
#
#
tanh
#
ℎ"
," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
LSTM Gradient Flow Uninterrupted gradient flow! #)
+,
+)
T I M tanh
$
!)
$
tanh
$
1 9 1
#"
S . 6
+"
tanh
$
$
tanh
$
!"
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
#*
+* tanh
$
$
tanh
$
!*
1/27/20
LSTMs: Key Concepts
1 9 1
1. Maintain a separate cell state from what is outputted 2. Use gates to control the flow of information
S . 6
• Forget gate gets rid of irrelevant information
• Store relevant information from current input
T I M
• Selectively update cell state
• Output gate returns a filtered version of the cell state
3. Backpropagation through time with uninterrupted gradient flow
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
S . 6
1 9 1
RNN Applications
T I M
Example Task: Music Generation F#
G
C
Input: sheet music
A
1 9 1
Output: next character in sheet music
E
T I M F#
G
S . 6
C
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
6.S191 Lab! H. Suresh, 6.S191 2018. 1/27/20
Example Task: Sentiment Classification sentiment
I
T I M love
this
S . 6
1 9 1
Input:
sequence of words
Output:
probability of having positive sentiment
loss = tf.nn.softmax_cross_entropy_with_logits(y, predicted)
class!
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Socher+, EMNLP 2013. 1/27/20
Example Task: Sentiment Classification sentiment
I
T I M love
this
Tweet sentiment classification
S . 6
1 9 1
class!
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Example Task: Machine Translation le
T I M the
dog
Encoder (English)
S . 6 eats
chien
mange
le
chien
1 9 1
Decoder (French)
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Example Task: Machine Translation le encoding bottleneck
T I M the
dog
Encoder (English)
S . 6
eats
chien
mange
le
chien
1 9 1
Decoder (French)
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
H. Suresh, 6.S191 2018. 1/27/20
Attention Mechanisms le
Attention mechanisms in neural networks provide learnable memory access
T I M the
dog
Encoder (English)
S . 6
eats
chien
mange
le
chien
1 9 1
Decoder (French)
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Sutskever+, NIPS 2014; Bahdanau+ ICLR 2015. 1/27/20
Trajectory Prediction: Self-Driving Cars
T I M
S . 6
1 9 1
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
Waymo. 1/27/20
Environmental Modeling
Particulates
T I M Winds
S . 6 SO2
1 9 1
Humidity 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
earth.nullschool.net 1/27/20
Deep Learning for Sequence Modeling: Summary 1. RNNs are well suited for sequence modeling tasks 2. Model sequences via a recurrence relation
1 9 1
3. Training RNNs with backpropagation through time
S . 6
4. Gated cells like LSTMs let us model long-term dependencies
5. Models for music generation, classification, machine translation, and more
T I M
6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning
1/27/20
6.S191: Introduction to Deep Learning
1 9 1
Lab 1: Introduction to TensorFlow and Music Generation with RNNs Link to download labs: http://introtodeeplearning.com#schedule
T I M
S . 6
1. Open the lab in Google Colab 2. Start executing code blocks and filling in the #TODOs 3. Need help? Find a TA or come to the front!!