Conv3d¶

class continual.Conv3d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None, temporal_fill='zeros')[source]¶

Continual 3D convolution over a spatio-temporal input signal.

Continual Convolutions were proposed by Hedegaard et al.: “Continual 3D Convolutional Neural Networks for Real-time Processing of Videos”, in ECCV (2022), https://arxiv.org/pdf/2106.00050.pdf (paper) https://www.youtube.com/watch?v=Jm2A7dVEaF4 (video).

Assuming an input of shape (B, C, T, H, W), it computes the convolution over one temporal instant t at a time where t ∈ range(T), and keeps an internal state. Two forward modes are supported here.

Parameters:

in_channels (int) – Number of channels in the input image
out_channels (int) – Number of channels produced by the convolution
kernel_size (int or tuple) – Size of the convolving kernel
stride (int or tuple, optional) – Stride of the convolution. NB: stride > 1 over the first channel is not supported. Default: 1
padding (int or tuple, optional) – Zero-padding added to all three sides of the input. NB: padding over the first channel is not supported. Default: 0
dilation (int or tuple, optional) – Spacing between kernel elements. NB: dilation > 1 over the first channel is not supported. Default: 1
groups (int, optional) – Number of blocked connections from input channels to output channels. Default: 1
bias (bool, optional) – If True, adds a learnable bias to the output. Default: True
temporal_fill (string, optional) – 'zeros' or 'replicate' (= “boring video”). temporal_fill determines how state is initialised and which padding is applied during forward_steps along the temporal dimension. Default: 'replicate'

Variables:

weight (Tensor) – the learnable weights of the module of shape $(\text{out\_channels}, \frac{\text{in\_channels}}{\text{groups}},$ $\text{kernel\_size[0]}, \text{kernel\_size[1]}, \text{kernel\_size[2]})$ . The values of these weights are sampled from $\mathcal{U}(-\sqrt{k}, \sqrt{k})$ where $k = \frac{groups}{C_\text{in} * \prod_{i=0}^{2}\text{kernel\_size}[i]}$
bias (Tensor) – the learnable bias of the module of shape (out_channels). If bias is True, then the values of these weights are sampled from $\mathcal{U}(-\sqrt{k}, \sqrt{k})$ where $k = \frac{groups}{C_\text{in} * \prod_{i=0}^{2}\text{kernel\_size}[i]}$
state (List[Tensor]) – a running buffer of partial computations from previous frames which are used for the calculation of subsequent outputs.