6S191 MIT DeepLearning L2 [PDF]

  • 0 0 0
  • Suka dengan makalah ini dan mengunduhnya? Anda bisa menerbitkan file PDF Anda sendiri secara online secara gratis dalam beberapa menit saja! Sign Up
File loading please wait...
Citation preview

Deep Sequence Modeling Ava Soleimany MIT 6.S191 January 27, 2020



6.S191 Introduction to Deep Learning



introtodeeplearning.com



@MITDeepLearning



Given an image of a ball, can you predict where it will go next?



T I M



S . 6



1 9 1



Given an image of a ball, can you predict where it will go next?



T I M



S . 6



1 9 1 ???



Given an image of a ball, can you predict where it will go next?



T I M



S . 6



1 9 1



Given an image of a ball, can you predict where it will go next?



T I M



S . 6



1 9 1



Sequences in the Wild



T I M



S . 6



1 9 1



Audio



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Sequences in the Wild character:



S . 6



1 9 1



6.S191 Introduction to Deep Learning word:



T I M



Text



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



S . 6



1 9 1



A Sequence Modeling Problem: Predict the Next Word



T I M



A Sequence Modeling Problem: Predict the Next Word “This morning I took my cat for a walk.”



T I M



S . 6



1 9 1



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



A Sequence Modeling Problem: Predict the Next Word “This morning I took my cat for a walk.” given these words



T I M



S . 6



1 9 1



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



A Sequence Modeling Problem: Predict the Next Word “This morning I took my cat for a walk.” given these words



T I M



S . 6



1 9 1



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



predict the next word



H. Suresh, 6.S191 2018. 1/27/20



Idea #1: Use a Fixed Window “This morning I took my cat for a walk.”



1 9 1



given these two words



T I M



S . 6



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



predict the next word



H. Suresh, 6.S191 2018. 1/27/20



Idea #1: Use a Fixed Window “This morning I took my cat for a walk.”



1 9 1



given these two words



S . 6



predict the next word



One-hot feature encoding: tells us what each word is



[1000001000]



T I M



for



a



prediction



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Problem #1: Can’t Model Long-Term Dependencies



1 9 1



“France is where I grew up, but I now live in Boston. I speak fluent ___.”



S . 6



J’aime 6.S191!



T I M



We need information from the distant past to accurately predict the correct word.



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Idea #2: Use Entire Sequence as Set of Counts “This morning I took my cat for a”



S . 6



“bag of words”



1 9 1



[ 0 1 0 0 1 0 0 … 0 0 1 1 0 0 0 1]



T I M



prediction



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Problem #2: Counts Don’t Preserve Order



1 9 1



The food was good, not bad at all.



S . 6 vs.



T I M



The food was bad, not good at all.



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Idea #3: Use a Really Big Fixed Window “This morning I took my cat for a walk.” given these words



S . 6



1 9 1 predict the next word



[10000000010010001000 00010 … ] morning



I



T I M



took



this



cat



prediction



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Problem #3: No Parameter Sharing



1 9 1



[10000000010010001000 00010 … ] this



morning



took



the



S . 6



cat



Each of these inputs has a separate parameter:



T I M



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Problem #3: No Parameter Sharing



1 9 1



[10000000010010001000 00010 … ] this



morning



took



the



S . 6



cat



Each of these inputs has a separate parameter:



T I M



[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ] this



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



morning



H. Suresh, 6.S191 2018. 1/27/20



Problem #3: No Parameter Sharing



1 9 1



[10000000010010001000 00010 … ] this



morning



took



the



S . 6



cat



Each of these inputs has a separate parameter:



T I M



[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 … ] this



morning



Things we learn about the sequence won’t transfer if they appear elsewhere in the sequence. 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Sequence Modeling: Design Criteria To model sequences, we need to: 1.



Handle variable-length sequences



S . 6



2. Track long-term dependencies 3.



Maintain information about order



4.



Share parameters across the sequence



T I M



1 9 1



RNN



Today: Recurrent Neural Networks (RNNs) as an approach to sequence modeling problems 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



S . 6



1 9 1



Recurrent Neural Networks (RNNs)



T I M



Standard Feed-Forward Neural Network #"



!



T I M



S . 6



1 9 1



One to One “Vanilla” neural network



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Networks for Sequence Modeling #"



!



T I M



One to One “Vanilla” neural network



S . 6



1 9 1



Many to One Sentiment Classification



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Networks for Sequence Modeling #"



!



T I M



One to One “Vanilla” neural network



S . 6



Many to One Sentiment Classification



1 9 1



Many to Many Music Generation 6.S191 Lab!



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Networks for Sequence Modeling #"



!



T I M



One to One “Vanilla” neural network



S . 6



Many to One Sentiment Classification



1 9 1



… and many other architectures and applications



Many to Many Music Generation 6.S191 Lab!



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Standard “Vanilla” Neural Network output vector



$#"



T I M



input vector



S . 6



1 9 1



!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Network (RNN) output vector



$#"



RNN



ℎ"



T I M



input vector



S . 6



1 9 1



!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Network (RNN) output vector



$#"



RNN



ℎ"



T I M recurrent cell



input vector



S . 6



1 9 1



!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Network (RNN) output vector



RNN



ℎ"



T I M recurrent cell



input vector



1 9 1



Apply a recurrence relation at every time step to process a sequence:



$#"



S . 6



!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Network (RNN) output vector



RNN



"



ℎ"



T I M recurrent cell



input vector



1 ℎ = $1(ℎ9 , * ) S . 6 Apply a recurrence relation at every time step to process a sequence:



-,"



cell state



%



"'(



function old state parameterized by W



"



input vector at time step t



*"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Recurrent Neural Network (RNN) output vector



RNN



"



ℎ"



T I M recurrent cell



input vector



*"



1 ℎ = $1(ℎ9 , * ) S . 6 Apply a recurrence relation at every time step to process a sequence:



-,"



cell state



%



"'(



function old state parameterized by W



"



input vector at time step t



Note: the same function and set of parameters are used at every time step



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNN Intuition



1 9 1 output vector



my_rnn = RNN() hidden_state = [0, 0, 0, 0]



S . 6



sentence = ["I", "love", "recurrent", "neural"] for word in sentence:



T I M



prediction, hidden_state = my_rnn(word, hidden_state)



$#"



RNN



recurrent cell



ℎ"



next_word_prediction = prediction # >>> "networks!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



input vector



!" 1/27/20



RNN Intuition



1 9 1 output vector



my_rnn = RNN() hidden_state = [0, 0, 0, 0]



S . 6



sentence = ["I", "love", "recurrent", "neural"] for word in sentence:



T I M



prediction, hidden_state = my_rnn(word, hidden_state)



$#"



RNN



recurrent cell



ℎ"



next_word_prediction = prediction # >>> "networks!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



input vector



!" 1/27/20



RNN Intuition



1 9 1 output vector



my_rnn = RNN() hidden_state = [0, 0, 0, 0]



S . 6



sentence = ["I", "love", "recurrent", "neural"] for word in sentence:



T I M



prediction, hidden_state = my_rnn(word, hidden_state)



$#"



RNN



recurrent cell



ℎ"



next_word_prediction = prediction # >>> "networks!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



input vector



!" 1/27/20



RNN State Update and Output output vector



$#"



RNN



ℎ"



T I M recurrent cell



input vector



S . 6



1 9 1



!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNN State Update and Output output vector



$#"



RNN



ℎ"



T I M recurrent cell



input vector



!"



S . 6



1 9 1



Input Vector



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



!"



1/27/20



RNN State Update and Output output vector



$#"



RNN



Update Hidden State



ℎ"



T I M recurrent cell



input vector



!"



S . 6



1 9 1



ℎ" = tanh(,.-- ℎ"/0 + ,.2- !" ) Input Vector



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



!"



1/27/20



RNN State Update and Output output vector



Output Vector



"!#



RNN



input vector



*#



S . 6



Update Hidden State



ℎ#



T I M recurrent cell



1 9 1



"!# = %(&' ℎ#



ℎ# = tanh(%(&& ℎ#01 + %(3& *# ) Input Vector



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



*#



1/27/20



RNNs: Computational Graph Across Time



$#"



RNN



!"



T I M =



S . 6



1 9 1



Represent as computational graph unrolled across time



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNNs: Computational Graph Across Time



$#"



RNN



!"



$#%



T I M =



S . 6



1 9 1



!%



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNNs: Computational Graph Across Time



$#"



RNN



!"



$#%



T I M =



!%



S . 6 $#&



1 9 1



!&



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNNs: Computational Graph Across Time



$#"



RNN



!"



$#%



T I M =



!%



S . 6 $#&



!&



1 9 1 $#'







$#"



!'







!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNNs: Computational Graph Across Time



$#"



RNN



$#%



T I M =



()*



!"



!%



S . 6 $#&



()*



!&



1 9 1 $#'



()*



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning











$#"



()*



!" 1/27/20



RNNs: Computational Graph Across Time



$#"



RNN



$#%



T I M =



(**



()*



!"



!%



S . 6 $#&



1 9 1 $#'



(**



()*



!&







$#"



(**



()*



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning







()*



!" 1/27/20



RNNs: Computational Graph Across Time



$#"



$#% (*+



RNN



T I M =



(**



()*



!"



!%



S . 6 $#&



(*+



1 9 1 $#'



(*+



(**



()*



!&







$#" (*+



(**



()*



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning







()*



!" 1/27/20



RNNs: Computational Graph Across Time



1 9 1



Re-use the same weight matrices at every time step $#"



$#% (*+



RNN



T I M =



(**



()*



!"



!%



S . 6 $#&



(*+



$#'



(*+



(**



()*



!&







$#" (*+



(**



()*



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning







()*



!" 1/27/20



RNNs: Computational Graph Across Time Forward pass



(% $#"



(&



$#% *,-



RNN



T I M =



*,,



*+,



!"



!%



S . 6 $#&



*,-



1 9 1 ('



$#'



*+,



!&







*,-



*,,



() $#" *,-



*,,



*+,



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning







*+,



!" 1/27/20



RNNs: Computational Graph Across Time Forward pass



(



(% $#"



(&



$#% *,-



RNN



T I M =



*,,



*+,



!"



!%



S . 6 $#&



*,-



1 9 1 ('



$#'



*+,



!&







*,-



*,,



() $#" *,-



*,,



*+,



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning







*+,



!" 1/27/20



RNNs from Scratch class MyRNNCell(tf.keras.layers.Layer): def __init__(self, rnn_units, input_dim, output_dim): super(MyRNNCell, self).__init__() # Initialize weight matrices self.W_xh = self.add_weight([rnn_units, input_dim]) self.W_hh = self.add_weight([rnn_units, rnn_units]) self.W_hy = self.add_weight([output_dim, rnn_units]) # Initialize hidden state to zeros self.h = tf.zeros([rnn_units, 1])



T I M



S . 6



1 9 1



def call(self, x): # Update the hidden state self.h = tf.math.tanh( self.W_hh * self.h + self.W_xh * x ) # Compute the output output = self.W_hy * self.h



output vector



$#"



RNN



recurrent cell



input vector



ℎ"



!"



# Return the current output and hidden state return output, self.h 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNN Implementation in TensorFlow



1 9 1 output vector



S . 6



tf.keras.layers.SimpleRNN(rnn_units)



T I M



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



$#"



RNN



recurrent cell



input vector



ℎ"



!" 1/27/20



S . 6



1 9 1



Backpropagation Through Time (BPTT)



T I M



Recall: Backpropagation in Feed Forward Models "



!



T I M



1 9 1



Backpropagation algorithm: 1. Take the derivative (gradient) of the loss with respect to each parameter 2. Shift parameters in order to minimize loss



S . 6



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



RNNs: Backpropagation Through Time Forward pass



(



(% $#"



(&



$#% *,-



RNN



T I M =



*,,



*+,



!"



!%



S . 6 $#&



*,-



1 9 1 ('



$#'



*+,



!&







*,-



*,,



() $#" *,-



*,,



*+,



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning







*+,



!" 1/27/20



RNNs: Backpropagation Through Time Forward pass Backward pass



(



(% $#"



(&



$#% *,-



RNN



T I M =



*,,



*+,



!"



!%



S . 6 $#&



*,-



1 9 1 ('



$#'



*+,



!&







*,-



*,,



() $#" *,-



*,,



*+,



!'



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning







*+,



!" Mozer Complex Systems 1989. 1/27/20



Standard RNN Gradient Flow ℎ" '()



#"



'))



')) '()



'()



#$



T I M



1 …9 1 '))



S . 6 #%



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



ℎ&



'()



#&



1/27/20



Standard RNN Gradient Flow ℎ" '()



#"



'))



')) '()



'()



#$



1 …9 1 '))



S . 6 #%



ℎ&



'()



#&



Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!



T I M



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Standard RNN Gradient Flow: Exploding Gradients ℎ" '()



'))



')) '()



#"



'()



#$



1 …9 1 '))



S . 6 #%



ℎ&



'()



#&



Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!



T I M



Many values > 1: exploding gradients Gradient clipping to scale big gradients



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Standard RNN Gradient Flow:Vanishing Gradients ℎ" '()



'))



')) '()



#"



'()



#$



1 …9 1 '))



S . 6 #%



ℎ&



'()



#&



Computing the gradient wrt ℎ" involves many factors of *++ + repeated gradient computation!



T I M



Many values > 1: exploding gradients Gradient clipping to scale big gradients



Many values < 1: vanishing gradients 1. Activation function 2. Weight initialization 3. Network architecture



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem?



T I M



S . 6



1 9 1



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together



T I M



S . 6



1 9 1



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together



S . 6



Errors due to further back time steps have smaller and smaller gradients



T I M



1 9 1



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together



S . 6



Errors due to further back time steps have smaller and smaller gradients



T I M



1 9 1



Bias parameters to capture short-term dependencies



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem? Multiply many small numbers together



S . 6



Errors due to further back time steps have smaller and smaller gradients



T I M



“The clouds are in the ___”



1 9 1



Bias parameters to capture short-term dependencies



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem?



“The clouds are in the ___” $#"



Multiply many small numbers together



S . 6



Errors due to further back time steps have smaller and smaller gradients



T I M



!"



1 9 1 $#%



$#&



$#'



$#(



!%



!&



!'



!(



Bias parameters to capture short-term dependencies



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem?



“The clouds are in the ___” $#"



Multiply many small numbers together



S . 6



Errors due to further back time steps have smaller and smaller gradients



T I M



!"



1 9 1 $#%



$#&



$#'



$#(



!%



!&



!'



!(



“I grew up in France, … and l speak fluent___ ”



Bias parameters to capture short-term dependencies



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



The Problem of Long-Term Dependencies Why are vanishing gradients a problem?



“The clouds are in the ___” $#"



Multiply many small numbers together



S . 6



Errors due to further back time steps have smaller and smaller gradients



T I M



!"



1 9 1 $#%



$#&



$#'



$#(



!%



!&



!'



!(



“I grew up in France, … and l speak fluent___ ”



$#"



$#%







$#)



$#)*%



!"



!%







!)



!)*%



Bias parameters to capture short-term dependencies



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Trick #1: Activation Functions



1 9 1



ReLU derivative



T I M



S . 6



Using ReLU prevents ! " from shrinking the gradients when # > 0



tanh derivative



sigmoid derivative



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Trick #2: Parameter Initialization Initialize weights to identity matrix Initialize biases to zero



T I M



S . 6



1 9 1



This helps prevent the weights from shrinking to zero.



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Solution #3: Gated Cells Idea: use a more complex recurrent unit with gates to control what information is passed through



T I M



S . 6



gated cell



1 9 1



LSTM, GRU, etc.



Long Short Term Memory (LSTMs) networks rely on a gated cell to track information throughout many time steps. 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



S . 6



1 9 1



Long Short Term Memory (LSTM) Networks



T I M



Standard RNN In a standard RNN, repeating modules contain a simple computation node #"



#"&%



S . 6



ℎ"&% tanh



ℎ"



tanh



T I M



!"&%



1 9 1



#"



%$!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



tanh



!"



%$1/27/20



Long Short Term Memory (LSTMs) LSTM modules contain computational blocks that control information flow #"



#")*



S . 6



tanh



T I M



$ $ tanh



!")*



tanh



$



$ $ tanh



1 9 1



$



!"



#"+*



tanh



$ $ tanh



$



!"+*



LSTM cells are able to track information throughout many timesteps tf.keras.layers.LSTM(num_units) 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Hochreiter & Schmidhuber, Neural Computation 1997. 1/27/20



Long Short Term Memory (LSTMs) Information is added or removed through structures called gates



T I M



S . 6 !



1 9 1



Gates optionally let information through, for example via a sigmoid neural net layer and pointwise multiplication 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



Long Short Term Memory (LSTMs) How do LSTMs work? 1) Forget 2) Store 3) Update 4) Output



1 9 1 !"



S . 6



(")*



T I M



tanh



-"



ℎ")*



("



#



#



tanh



#



ℎ"



," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Hochreiter & Schmidhuber, Neural Computation 1997. 1/27/20



Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output LSTMs forget irrelevant parts of the previous state



1 9 1 !"



S . 6



(")*



T I M



tanh



-"



ℎ")*



("



#



#



tanh



#



ℎ"



," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Olah, “Understanding LSTMs”. 1/27/20



Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output LSTMs store relevant new information into the cell state



1 9 1 !"



S . 6



#"



%$T I M ℎ"



%$-"



&



&



#"



tanh



tanh



&



ℎ"



," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Olah, “Understanding LSTMs”. 1/27/20



Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output LSTMs selectively update cell state values



1 9 1 %"



S . 6



,"#$



T I M ℎ"#$



,"



tanh



'



'



tanh



'



ℎ"



&" 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Olah, “Understanding LSTMs”. 1/27/20



Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output The output gate controls what information is sent to the next time step



1 9 1 !"



S . 6



(")*



T I M ℎ")*



("



tanh



-"



#



#



tanh



#



ℎ"



," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Olah, “Understanding LSTMs”. 1/27/20



Long Short Term Memory (LSTMs) 1) Forget 2) Store 3) Update 4) Output



1 9 1 !"



S . 6



(")*



T I M



tanh



-"



ℎ")*



("



#



#



tanh



#



ℎ"



," 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



LSTM Gradient Flow Uninterrupted gradient flow! #)



+,



+)



T I M tanh



$



!)



$



tanh



$



1 9 1



#"



S . 6



+"



tanh



$



$



tanh



$



!"



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



#*



+* tanh



$



$



tanh



$



!*



1/27/20



LSTMs: Key Concepts



1 9 1



1. Maintain a separate cell state from what is outputted 2. Use gates to control the flow of information



S . 6



• Forget gate gets rid of irrelevant information



• Store relevant information from current input



T I M



• Selectively update cell state



• Output gate returns a filtered version of the cell state



3. Backpropagation through time with uninterrupted gradient flow



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



S . 6



1 9 1



RNN Applications



T I M



Example Task: Music Generation F#



G



C



Input: sheet music



A



1 9 1



Output: next character in sheet music



E



T I M F#



G



S . 6



C



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



6.S191 Lab! H. Suresh, 6.S191 2018. 1/27/20



Example Task: Sentiment Classification sentiment



I



T I M love



this



S . 6



1 9 1



Input:



sequence of words



Output:



probability of having positive sentiment



loss = tf.nn.softmax_cross_entropy_with_logits(y, predicted)



class!



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Socher+, EMNLP 2013. 1/27/20



Example Task: Sentiment Classification sentiment



I



T I M love



this



Tweet sentiment classification



S . 6



1 9 1



class!



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Example Task: Machine Translation le



T I M the



dog



Encoder (English)



S . 6 eats



chien



mange



le



chien



1 9 1



Decoder (French)



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Example Task: Machine Translation le encoding bottleneck



T I M the



dog



Encoder (English)



S . 6



eats



chien



mange



le



chien



1 9 1



Decoder (French)



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



H. Suresh, 6.S191 2018. 1/27/20



Attention Mechanisms le



Attention mechanisms in neural networks provide learnable memory access



T I M the



dog



Encoder (English)



S . 6



eats



chien



mange



le



chien



1 9 1



Decoder (French)



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Sutskever+, NIPS 2014; Bahdanau+ ICLR 2015. 1/27/20



Trajectory Prediction: Self-Driving Cars



T I M



S . 6



1 9 1



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



Waymo. 1/27/20



Environmental Modeling



Particulates



T I M Winds



S . 6 SO2



1 9 1



Humidity 6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



earth.nullschool.net 1/27/20



Deep Learning for Sequence Modeling: Summary 1. RNNs are well suited for sequence modeling tasks 2. Model sequences via a recurrence relation



1 9 1



3. Training RNNs with backpropagation through time



S . 6



4. Gated cells like LSTMs let us model long-term dependencies



5. Models for music generation, classification, machine translation, and more



T I M



6.S191 Introduction to Deep Learning introtodeeplearning.com @MITDeepLearning



1/27/20



6.S191: Introduction to Deep Learning



1 9 1



Lab 1: Introduction to TensorFlow and Music Generation with RNNs Link to download labs: http://introtodeeplearning.com#schedule



T I M



S . 6



1. Open the lab in Google Colab 2. Start executing code blocks and filling in the #TODOs 3. Need help? Find a TA or come to the front!!