[link]
This paper tackles the challenge of action recognition by representing a video as spacetime graphs: **similarity graph** captures the relationship between correlated objects in the video while the **spatialtemporal graph** captures the interaction between objects. The algorithm is composed of several modules: https://i.imgur.com/DGacPVo.png 1. **Inflated 3D (I3D) network**. In essence, it is usual 2D CNN (e.g. ResNet50) converted to 3D CNN by copying 2D weights along an additional dimension and subsequent renormalization. The network takes *batch x 3 x 32 x 224 x 224* tensor input and outputs *batch x 16 x 14 x 14*. 2. **Region Proposal Network (RPN)**. This is the same RPN used to predict initial bounding boxes in twostage detectors like Faster RCNN. Specifically, it predicts a predefined number of bounding boxes on every other frame of the input (initially input is 32 frames, thus 16 frames are used) to match the temporal dimension of I3D network's output. Then, I3D network output features and projected on them bounding boxes are passed to ROIAlign to obtain temporal features for each object proposal. Fortunately, PyTorch comes with a [pretrained Faster RCNN on MSCOCO](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) which can be easily cut to have only RPN functionality. 3. **Similarity Graph**. This graph represents a feature similarity between different objects in a video. Having features $x_i$ extracted by RPN+ROIAlign for every bounding box predictions in a video, the similarity between any pair of objects is computed as $F(x_i, x_j) = (wx_i)^T * (w'x_j)$, where $w$ and $w'$ are learnable transformation weights. Softmax normalization is performed on each edge on the graph connected to a current node $i$. Graph convolutional network is represented as several graph convolutional layers with ReLU activation in between. Graph construction and convolutions can be conveniently implemented using [PyTorch Geometric](https://github.com/rusty1s/pytorch_geometric). 4. **SpatialTemporal Graph**. This graph captures a spatial and temporal relationship between objects in neighboring frames. To construct a graph $G_{i,j}^{front}$, we need to iterate through every bounding box in frame $t$ and compute Intersection over Union (IoU) with every object in frame $t+1$. The IoU value serves as the weight of the edge connecting nodes (ROI aligned features from RPN) $i$ and $j$. The edge values are normalized so that the sum of edge values connected to proposal $i$ will be 1. In a similar manner, the backward graph $G_{i,j}^{back}$ is defined by analyzing frames $t$ and $t1$. 5. **Classification Head**. The classification head takes two inputs. One is coming from average pooled features from I3D model resulting in *1 x 512* tensor. The other one is from pooled sum of features (i.e. *1 x 512* tensor) from the graph convolutional networks defined above. Both inputs are concatenated and fed to FullyConnected (FC) layer to perform final multilabel (or multiclass) classification. **Dataset**. The authors have tested the proposed algorithm on [SomethingSomething](https://20bn.com/datasets/somethingsomething) and [Charades](https://allenai.org/plato/charades/) datasets. For the first dataset, a softmax loss function is used, while the second one utilizes binary sigmoid loss to handle a multilabel property. The input data is sampled at 6fps, covering about 5 seconds of a video input. **My take**. I think this paper is a great engineering effort. While the paper is easy to understand at the highlevel, implementing it is much harder partially due to unclear/misleading writing/description. I have challenged myself with [reproducing this paper](https://github.com/BAILOOL/VideosasSpaceTimeRegionGraphs). It is work in progress, so be careful not to damage your PC and eyes :)
Your comment:
