Recurrent Neural Network
Also check LSTM
This is a pure numpy implementation of word generation using an RNN
We’re going to have our network learn how to predict the next words in a given paragraph. This will require a recurrent architecture since the network will have to remember a sequence of characters. The order matters. 1000 iterations and we’ll have pronouncable english. The longer the training time the better. You can feed it any text sequence (words, python, HTML, etc.)
What is a Recurrent Network?
Feedforward networks are great for learning a pattern between a set of inputs and outputs.
- temperature & location
- height & weight
- car speed and brand
But what if the ordering of the data matters?
Alphabet, Lyrics of a song. These are stored using Conditional Memory. You can only access an element if you have access to the previous elements (like a linkedlist).
Enter recurrent networks
We feed the hidden state from the previous time step back into the the network at the next time step.
So instead of the data flow operation happening like this
input -> hidden -> output
it happens like this
(input + prev_hidden) -> hidden -> output
wait. Why not this?
(input + prev_input) -> hidden -> output
Hidden recurrence learns what to remember whereas input recurrence is hard wired to just remember the immediately previous datapoint
RNN Formula
It basically says the current hidden state h(t) is a function f of the previous hidden state h(t-1) and the current input x(t). The theta are the parameters of the function f. The network typically learns to use h(t) as a kind of lossy summary of the task-relevant aspects of the past sequence of inputs up to t.
Loss function
The total loss for a given sequence of x values paired with a sequence of y values would then be just the sum of the losses over all the time steps. For example, if L(t) is the negative log-likelihood of y (t) given x (1), . . . , x (t) , then sum them up you get the loss for the sequence
Our steps
- Initialize weights randomly
- Give the model a char pair (input char & target char. The target char is the char the network should guess, its the next char in our sequence)
- Forward pass (We calculate the probability for every possible next char according to the state of the model, using the paramters)
- Measure error (the distance between the previous probability and the target char)
- We calculate gradients for each of our parameters to see their impact they have on the loss (backpropagation through time)
- update all parameters in the direction via gradients that help to minimise the loss
- Repeat! Until our loss is small AF
What are some use cases?
- Time series prediction (weather forecasting, stock prices, traffic volume, etc. )
- Sequential data generation (music, video, audio, etc.)
Other Examples
-https://github.com/anujdutt9/RecurrentNeuralNetwork (binary addition)
What’s next?
1 LSTM Networks 2 Bidirectional networks 3 recursive networks
The code contains 4 parts
- Load the trainning data
- encode char into vectors
- Define the Recurrent Network
- Define a loss function
- Forward pass
- Loss
- Backward pass
- Define a function to create sentences from the model
- Train the network
- Feed the network
- Calculate gradient and update the model parameters
- Output a text to see the progress of the training
Load the training data
The network need a big txt file as an input.
The content of the file will be used to train the network.
I use Methamorphosis from Kafka (Public Domain). Because Kafka was one weird dude. I like.
In [3]:
data = open('kafka.txt', 'r').read()chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d chars, %d unique' % (data_size, vocab_size)data has 137629 chars, 81 unique
Encode/Decode char/vector
Neural networks operate on vectors (a vector is an array of float) So we need a way to encode and decode a char as a vector.
We’ll count the number of unique chars (vocab_size). That will be the size of the vector. The vector contains only zero exept for the position of the char wherae the value is 1.
So First let’s calculate the vocab_size:
In [5]:
char_to_ix = { ch:i for i,ch in enumerate(chars)}
ix_to_char = { i:ch for i, ch in enumerate(chars)}
print char_to_ix
print ix_to_char{'\n': 0, 'C': 31, '!': 3, ' ': 4, '"': 5, '%': 6, '$': 7, "'": 8, ')': 9, '(': 10, '*': 11, '-': 12, ',': 13, '/': 2, '.': 15, '1': 16, '0': 17, '3': 18, '2': 19, '5': 20, '4': 21, '7': 22, '6': 23, '9': 24, '8': 25, ';': 26, ':': 27, '?': 28, 'A': 29, '@': 30, '\xc3': 1, 'B': 32, 'E': 33, 'D': 34, 'G': 35, 'F': 36, 'I': 37, 'H': 38, 'K': 39, 'J': 40, 'M': 41, 'L': 42, 'O': 43, 'N': 44, 'Q': 45, 'P': 46, 'S': 47, 'R': 48, 'U': 49, 'T': 50, 'W': 51, 'V': 52, 'Y': 53, 'X': 54, 'd': 59, 'a': 55, 'c': 56, 'b': 57, 'e': 58, '\xa7': 14, 'g': 60, 'f': 61, 'i': 62, 'h': 63, 'k': 64, 'j': 65, 'm': 66, 'l': 67, 'o': 68, 'n': 69, 'q': 70, 'p': 71, 's': 72, 'r': 73, 'u': 74, 't': 75, 'w': 76, 'v': 77, 'y': 78, 'x': 79, 'z': 80}---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-83f18c934ed8> in <module>()
2 ix_to_char = { i:ch for i, ch in enumerate(chars)}
3 print char_to_ix
----> 4 print char_to_charNameError: name 'char_to_char' is not defined
Then we create 2 dictionary to encode and decode a char to an int
In [ ]:
Finaly we create a vector from a char like this:
The dictionary defined above allosw us to create a vector of size 61 instead of 256.
Here and exemple of the char ‘a’
The vector contains only zeros, except at position char_to_ix[‘a’] where we put a 1.
In [7]:
import numpy as npvector_for_char_a = np.zeros((vocab_size, 1))
vector_for_char_a[char_to_ix['a']] = 1
print vector_for_char_a.ravel()[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0.]
Definition of the network
The neural network is made of 3 layers:
- an input layer
- an hidden layer
- an output layer
All layers are fully connected to the next one: each node of a layer are conected to all nodes of the next layer. The hidden layer is connected to the output and to itself: the values from an iteration are used for the next one.
To centralise values that matter for the training (hyper parameters) we also define the sequence lenght and the learning rate
In [9]:
#hyperparametersFile "<ipython-input-9-695b49af656c>", line 5
learning_rate 1e-1
^
SyntaxError: invalid syntax
In [12]:
#model parametershidden_size = 100
seq_length = 25
learning_rate = 1e-1Wxh = np.random.randn(hidden_size, vocab_size) * 0.01 #input to hidden
Whh = np.random.randn(hidden_size, hidden_size) * 0.01 #input to hidden
Why = np.random.randn(vocab_size, hidden_size) * 0.01 #input to hidden
bh = np.zeros((hidden_size, 1))
by = np.zeros((vocab_size, 1))
The model parameters are adjusted during the trainning.
- Wxh are parameters to connect a vector that contain one input to the hidden layer.
- Whh are parameters to connect the hidden layer to itself. This is the Key of the Rnn: Recursion is done by injecting the previous values from the output of the hidden state, to itself at the next iteration.
- Why are parameters to connect the hidden layer to the output
- bh contains the hidden bias
- by contains the output bias
You’ll see in the next section how theses parameters are used to create a sentence.
Define the loss function
The loss is a key concept in all neural networks training. It is a value that describe how good is our model.
The smaller the loss, the better our model is.
(A good model is a model where the predicted output is close to the training output)
During the training phase we want to minimize the loss.
The loss function calculates the loss but also the gradients (see backward pass):
- It perform a forward pass: calculate the next char given a char from the training set.
- It calculate the loss by comparing the predicted char to the target char. (The target char is the input following char in the tranning set)
- It calculate the backward pass to calculate the gradients
This function take as input:
- a list of input char
- a list of target char
- and the previous hidden state
This function outputs:
- the loss
- the gradient for each parameters between layers
- the last hidden state
Forward pass
The forward pass use the parameters of the model (Wxh, Whh, Why, bh, by) to calculate the next char given a char from the trainning set.
xs[t] is the vector that encode the char at position t ps[t] is the probabilities for next char
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
or is dirty pseudo code for each char
hs = input*Wxh + last_value_of_hidden_state*Whh + bh
ys = hs*Why + by
ps = normalized(ys)
Backward pass
The naive way to calculate all gradients would be to recalculate a loss for small variations for each parameters. This is possible but would be time consuming. There is a technics to calculates all the gradients for all the parameters at once: the backdrop propagation.
Gradients are calculated in the oposite order of the forward pass, using simple technics.
goal is to calculate gradients for the forward formula:
hs = input*Wxh + last_value_of_hidden_state*Whh + bh
ys = hs*Why + by
The loss for one datapoint
How should the computed scores inside f change tto decrease the loss? We’ll need to derive a gradient to figure that out.
Since all output units contribute to the error of each hidden unit we sum up all the gradients calculated at each time step in the sequence and use it to update the parameters. So our parameter gradients becomes :
Our first gradient of our loss. We’ll backpropagate this via chain rule
The chain rule is a method for finding the derivative of composite functions, or functions that are made by combining one or more functions.
In [14]:
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
#store our inputs, hidden states, outputs, and probability values
xs, hs, ys, ps, = {}, {}, {}, {} #Empty dicts
# Each of these are going to be SEQ_LENGTH(Here 25) long dicts i.e. 1 vector per time(seq) step
# xs will store 1 hot encoded input characters for each of 25 time steps (26, 25 times)
# hs will store hidden state outputs for 25 time steps (100, 25 times)) plus a -1 indexed initial state
# to calculate the hidden state at t = 0
# ys will store targets i.e. expected outputs for 25 times (26, 25 times), unnormalized probabs
# ps will take the ys and convert them to normalized probab for chars
# We could have used lists BUT we need an entry with -1 to calc the 0th hidden layer
# -1 as a list index would wrap around to the final element
xs, hs, ys, ps = {}, {}, {}, {}
#init with previous hidden state
# Using "=" would create a reference, this creates a whole separate copy
# We don't want hs[-1] to automatically change if hprev is changed
hs[-1] = np.copy(hprev)
#init loss as 0
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation (we place a 0 vector as the t-th input)
xs[t][inputs[t]] = 1 # Inside that t-th input we use the integer in "inputs" list to set the correct
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
#initalize vectors for gradient values for each set of weights
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
#output probabilities
dy = np.copy(ps[t])
#derive our first gradient
dy[targets[t]] -= 1 # backprop into y
#compute output gradient - output times hidden states transpose
#When we apply the transpose weight matrix,
#we can think intuitively of this as moving the error backward
#through the network, giving us some sort of measure of the error
#at the output of the lth layer.
#output gradient
dWhy += np.dot(dy, hs[t].T)
#derivative of output bias
dby += dy
#backpropagate!
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw #derivative of hidden bias
dWxh += np.dot(dhraw, xs[t].T) #derivative of input to hidden layer weight
dWhh += np.dot(dhraw, hs[t-1].T) #derivative of hidden layer to hidden layer weight
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
Create a sentence from the model
In [17]:
#prediction, one full forward pass
def sample(h, seed_ix, n):
"""
sample a sequence of integers from the model
h is memory state, seed_ix is seed letter for first time step
n is how many characters to predict
"""
#create vector
x = np.zeros((vocab_size, 1))
#customize it for our seed char
x[seed_ix] = 1
#list to store generated chars
ixes = []
#for as many characters as we want to generate
for t in xrange(n):
#a hidden state at a given time step is a function
#of the input at the same time step modified by a weight matrix
#added to the hidden state of the previous time step
#multiplied by its own hidden state to hidden state matrix.
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
#compute output (unnormalised)
y = np.dot(Why, h) + by
## probabilities for next chars
p = np.exp(y) / np.sum(np.exp(y))
#pick one with the highest probability
ix = np.random.choice(range(vocab_size), p=p.ravel())
#create a vector
x = np.zeros((vocab_size, 1))
#customize it for the predicted char
x[ix] = 1
#add it to the list
ixes.append(ix) txt = ''.join(ix_to_char[ix] for ix in ixes)
print '----\n %s \n----' % (txt, )
hprev = np.zeros((hidden_size,1)) # reset RNN memory
#predict the 200 next characters given 'a'
sample(hprev,char_to_ix['a'],200)----
HD')JsuLNgDw!4ejckLaA6zAt*(d'�5O;p5DoglJ!u-RN'v/:XaByA1Q"BWcO*Tip:AO/O%Wl/o)�Q1(DC.�3ztLPh/9RJ'
?W"p
hpXoCIQgN3!j0-@49z9Hit5Kc$mn'L8
:T.T;2.bKd(GS$7 ID8@?iitb2"OSubD7wJyh3@(:5hEmKX4T?i;;Ceq,BmY2yoPHx
----
Training
This last part of the code is the main trainning loop:
- Feed the network with portion of the file. Size of chunk is seq_lengh
- Use the loss function to:
- Do forward pass to calculate all parameters for the model for a given input/output pairs
- Do backward pass to calculate all gradiens
- Print a sentence from a random seed using the parameters of the network
- Update the model using the Adaptative Gradien technique Adagrad
Feed the loss function with inputs and targets
We create two array of char from the data file, the targets one is shifted compare to the inputs one.
For each char in the input array, the target array give the char that follows.
In [15]:
p=0
inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
print "inputs", inputs
targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
print "targets", targetsinputs [43, 69, 58, 4, 66, 68, 73, 69, 62, 69, 60, 13, 4, 76, 63, 58, 69, 4, 35, 73, 58, 60, 68, 73, 4]
targets [69, 58, 4, 66, 68, 73, 69, 62, 69, 60, 13, 4, 76, 63, 58, 69, 4, 35, 73, 58, 60, 68, 73, 4, 47]
Adagrad to update the parameters
This is a type of gradient descent strategy
step size = learning rate
The easiest technics to update the parmeters of the model is this:
param += dparam * step_size
Adagrad is a more efficient technique where the step_size are getting smaller during the training.
It use a memory variable that grow over time:
mem += dparam * dparam
and use it to calculate the step_size:
step_size = 1./np.sqrt(mem + 1e-8)
In short:
mem += dparam * dparam
param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update
Smooth_loss
Smooth_loss doesn’t play any role in the training. It is just a low pass filtered version of the loss:
smooth_loss = smooth_loss * 0.999 + loss * 0.001
It is a way to average the loss on over the last iterations to better track the progress
So finally
Here the code of the main loop that does both trainning and generating text from times to times:
In [18]:
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while n<=1000*100:
# prepare inputs (we're sweeping from left to right in steps seq_length long)
# check "How to feed the loss function to see how this part works
if p+seq_length+1 >= len(data) or n == 0:
hprev = np.zeros((hidden_size,1)) # reset RNN memory
p = 0 # go from start of data
inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]] # forward seq_length characters through the net and fetch gradient
loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
smooth_loss = smooth_loss * 0.999 + loss * 0.001 # sample from the model now and then
if n % 1000 == 0:
print 'iter %d, loss: %f' % (n, smooth_loss) # print progress
sample(hprev, inputs[0], 200) # perform parameter update with Adagrad
for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
[dWxh, dWhh, dWhy, dbh, dby],
[mWxh, mWhh, mWhy, mbh, mby]):
mem += dparam * dparam
param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update p += seq_length # move data pointer
n += 1 # iteration counteriter 0, loss: 109.861223
----
/gQv8H'-656q@7o 0A
r2,GWqf:
l1
dPSqT18X!7g"*,M�-Gw�rM9e(M(KLW)Yt;C?3N*jWJx@i*BMdoU4K5MoeJOI/�4�-l
MA.*)HgUd,o4bW;'DK/')NE
$:z9b6xkaqA$PnVRTjRLtG�Cf gbv�DH�:os;ByVL$F(fxsSo(v86W.kt,8i miEOr6"y0o(5Mx �/
----
iter 1000, loss: 86.766971
----
hi k cehi. kposod w pit se oc hebint gi d bis ,he nn ne to- t mir ildtec- aovi s tof lica . tha bilt Ig mclinlf, t orir!ee peld wafrecathe fm doathioame mhass fd al w, eagoy s woyigs miievonr, iee W
----
iter 2000, loss: 69.478687
----
he ate leth hebuelyhe wGthremilkel hore cke inse mo highe Is hodont mlathem wise ceuG here le sime socleelldk nuSind d aS there'cecichind hither whe angave he ut soure herle tathes be nos, athlle sos
----
iter 3000, loss: 60.740861
----
ld oulile hisisced abdan shimttey ir ad hay leri gousse way ynas,rs simos himonac the way in ag lnere wouis sfat wad boon ooutilit; whare forey ruro ud tuuly he furny arghly lus bpers;nupdey hered bom
----
iter 4000, loss: 56.398800
----
her thonilgme heprailytinwok hle thit tootwis acks in aas fom songain towith waen him the waninty on the beinl win thitghed ning wspa nhat tweaw. bus hire moror'od hre in ang whithenghaw whoraliwiit b
----
iter 5000, loss: 61.150802
----
erinfpmeacho cuthath7rPUiitey tor Glfer lewd entin7 bo asche toOinLlen Groa ciwinreIed yoin
atho1le p1sthMyin clqot 1kee athmtina7froit acros.in kertho inm, at7iveeiGpBoelem mo erwor ast.ot oach mche
----
iter 6000, loss: 62.854070
----
asidyade sy uld Eafere whey cist om not cad hay cally' robd ffot sa bustepcomed. Beger chawgesmey coull thouns,e wouls ir co thomered it hit 1o rad Inat dof hit nf, are bl oo hew of woundall c0earacle
----
iter 7000, loss: 57.474113
----
ems bound ht hacteld has ar teis the his oume co Boom hinde piud tor, sis a ding nod ut, saprisdi ang
es oyoustacullaghs n6 fet havat thof was waabs angimy an atC ace rumtome thovi hef wad fafattenin
----
iter 8000, loss: 53.622344
----
aed the nised wat sthed anr the woulch not sued On in the eat qurecantt avem shipke faed il reln t, at have the hime hlecr hid coobd, beghem aly en andand sidf and lers ceeve ild on in ad Gred oipnea
-------------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-18-800feb8589cf> in <module>()
13
14 # forward seq_length characters through the net and fetch gradient
---> 15 loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
16 smooth_loss = smooth_loss * 0.999 + loss * 0.001
17 <ipython-input-14-0bac30c55025> in lossFun(inputs, targets, hprev)
28 xs[t][inputs[t]] = 1 # Inside that t-th input we use the integer in "inputs" list to set the correct
29 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
---> 30 ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
31 ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
32 loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)KeyboardInterrupt:
Final words :
Follow me as I write about Machine Learning , Data Science and Blockchain . You can follow me and my code on Github
Have a nice day !!!