In this paper, we propose a convolutional layer inspired by optical flow algorithms to learn motion representations. Our representation flow layer is a fully-differentiable layer designed to optimally capture the `flow' of any representation channel within a convolutional neural network. Its parameters for iterative flow optimization are learned in an end-to-end fashion together with the other model parameters, maximizing the action recognition performance. Furthermore, we newly introduce the concept of learning `flow of flow' representations by stacking multiple representation flow layers. We conducted extensive experimental evaluations, confirming its advantages over previous recognition models using traditional optical flows in both computational speed and performance.
Motivated standard optical flow estimation methods, we design a learnable CNN layer to iteratively compute the `representation flow' field based on input CNN representations (check the paper for more details):
We can place the representation flow layer within any CNN, and train it end-to-end to optimize for activity recognition:
|(2+1)D + Rep-Flow||75.5||77.1||622ms|
|(2+1)D + Flow-of-flow||77.1||81.1||654ms|
Examples of representation flows for various actions. The representation flow is computed after the 3rd residual block and captures some sematic motion information. At this point, the representations are low dimensional (28x28).
Examples of representation flows for different channels for "clapping." Some channels capture the hand motion, while other channels focus on different features/motion patterns not present in this clip.
Examples of Flows for different channels for "hand-stand."
Examples of Flows for different channels for "running."
Examples of Flow-of-flows for various actions. The flow-of-flow captures an acceleration-type motion feature (change in motion over time). Due to the additional conv layer between the two flow layers, it also produces a more smooth flow estimation. These representations are less semantically understandable as the original features are more abstract and the flow-of-flow is less semantically interpretable. Regardless, we can still see some distinct motion patterns matching the video.