GRU¶

class continual.GRU(input_size, hidden_size, num_layers=1, bias=True, dropout=0.0, device=None, dtype=None, *args, **kwargs)[source]¶

Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

\begin{array}{ll} r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t = (1 - z_t) * n_t + z_t * h_{(t-1)} \end{array}

where $h_t$ is the hidden state at time t, $x_t$ is the input at time t, $h_{(t-1)}$ is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and $r_t$ , $z_t$ , $n_t$ are the reset, update, and new gates, respectively. $\sigma$ is the sigmoid function, and $*$ is the Hadamard product.

In a multilayer GRU, the input $x^{(l)}_t$ of the $l$ -th layer ( $l >= 2$ ) is the hidden state $h^{(l-1)}_t$ of the previous layer multiplied by dropout $\delta^{(l-1)}_t$ where each $\delta^{(l-1)}_t$ is a Bernoulli random variable which is $0$ with probability dropout.

Parameters:

input_size (int) – The number of expected features in the input x
hidden_size (int) – The number of features in the hidden state h
num_layers (int) – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two GRUs together to form a stacked GRU, with the second GRU taking in outputs of the first GRU and computing the final results. Default: 1
bias (bool) – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
dropout (float) – If non-zero, introduces a Dropout layer on the outputs of each GRU layer except the last layer, with dropout probability equal to dropout. Default: 0

Inputs: input, h_0

input: tensor of shape $(N, H_{in}, L)$ containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.
h_0: tensor of shape $(\text{num\_layers}, N, H_{out})$ containing the initial hidden state for each element in the batch. Defaults to zeros if not provided.

where:

\begin{aligned} N ={} & \text{batch size} \\ L ={} & \text{sequence length} \\ H_{in} ={} & \text{input\_size} \\ H_{out} ={} & \text{hidden\_size} \end{aligned}

Outputs: output, h_n

output: tensor of shape $(N, H_{out}, L)$ containing the output features (h_t) from the last layer of the GRU, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
h_n: tensor of shape $(\text{num\_layers}, N, H_{out})$ containing the final hidden state for each element in the batch.

Variables:

weight_ih_l[k] – the learnable input-hidden weights of the $\text{k}^{th}$ layer (W_ir|W_iz|W_in), of shape (3*hidden_size, input_size) for k = 0. Otherwise, the shape is (3*hidden_size, num_directions * hidden_size)
weight_hh_l[k] – the learnable hidden-hidden weights of the $\text{k}^{th}$ layer (W_hr|W_hz|W_hn), of shape (3*hidden_size, hidden_size)
bias_ih_l[k] – the learnable input-hidden bias of the $\text{k}^{th}$ layer (b_ir|b_iz|b_in), of shape (3*hidden_size)
bias_hh_l[k] – the learnable hidden-hidden bias of the $\text{k}^{th}$ layer (b_hr|b_hz|b_hn), of shape (3*hidden_size)

Note

All the weights and biases are initialized from $\mathcal{U}(-\sqrt{k}, \sqrt{k})$ where $k = \frac{1}{\text{hidden\_size}}$

Note

For bidirectional GRUs are not supported.

Note

Contrary to the module version found in torch.nn, this module assumes batch first, channel next, and temporal dimension last.

Examples:

gru = co.GRU(input_size=10, hidden_size=20, num_layers=2)
#               B, C,  T
x = torch.randn(1, 10, 16)

# torch API
h0 = torch.randn(2, 1, 20)
output, hn = gru(x, h0)

# continual inference API
gru.set_state(h0)
firsts = gru.forward_steps(x[:,:,:-1])
last = gru.forward_step(x[:,:,-1])

assert torch.allclose(firsts, output[:, :, :-1])
assert torch.allclose(last, output[:, :, -1])