GRU units

Dimitri Fichou

2022-07-12

Feed forward pass

\[ r_t = sigmoid(h_{t-1} * W_r + x_t * U_r) \]

\[ z_t = sigmoid(h_{t-1} * W_z + x_t * U_z) \]

\[ g_t = tanh(W_g * (h_{t-1} \cdot r_t) + x_t * U_g) \]

\[ h_t = y_t = h_{t-1} \cdot (1 - z_t) + (z_t \cdot g_t) \]

Back propagation pass

To perform the BPTT with a GRU unit, we have the eror comming from the top layer (\(\delta 1\)), the future hidden states (\(\delta 2\)). Also, we have stored during the feed forward the states at each step of the feeding. In the case of the future layer, this error is just set to zero if not calculated yet. For convention, \(\cdot\) correspond to point wise multiplication, while \(*\) correspond to matrix multiplication.

The rules on how to back prpagate come from this post.

\[\delta 3 = \delta 1 + \delta 2 \]

\[\delta 4 = (1 - z_t) \cdot \delta 3 \]

\[\delta 5 = \delta 3 \cdot h_{t-1} \]

\[\delta 6 = 1 - \delta 5 \]

\[\delta 7 = \delta 3 \cdot g_t \]

\[\delta 8 = \delta 3 \cdot z_t \]

\[\delta 9 = \delta 7 + \delta 8 \]

\[\delta 10 = \delta 8 \cdot tanh'(g_t) \]

\[\delta 11 = \delta 9 \cdot sigmoid'(z_t) \]

\[\delta 12 = \delta 10 * W_g^T \] \[\delta 13 = \delta 10 * U_g^T \] \[\delta 14 = \delta 11 * W_z^T \] \[\delta 15 = \delta 11 * U_z^T \]

\[\delta 16 = \delta 13 \cdot h_{t-1} \] \[\delta 17 = \delta 13 \cdot r_t \]

\[\delta 18 = \delta 17 \cdot sigmoid'(r_t) \]

\[\delta 19 = \delta 17 + \delta 4 \]

\[\delta 20 = \delta 18 * W_r^T \] \[\delta 21 = \delta 18 * U_r^T \]

\[\delta 22 = \delta 21 + \delta 15 \]

\[\delta 23 = \delta 19 + \delta 22 \]

\[\delta 24 = \delta 12 + \delta 14 +\delta 20 \]

The error \(\delta 23\) and \(\delta 24\) are used for the next layers. Once all those errors are available, it is possible to calculate the weight update.

\[\delta W_r = \delta W_f + h_{t-1}^T * \delta 10 \] \[\delta U_r = \delta U_f + x_{t}^T * \delta 10 \]

\[\delta W_z = \delta W_i + h_{t-1}^T * \delta 11 \] \[\delta U_z = \delta U_i + x_{t}^T * \delta 11 \]

\[\delta W_g = \delta W_g + (h_{t-1}^T \cdot r_t) * \delta 18 \] \[\delta U_g = \delta U_g + x_{t}^T * \delta 18 \]