Video Representation: More than Feature Mean

Xiang Xiang, Department of Computer Science, Whiting School of Engineering, Johns Hopkins University


We need the mean and variation to represent a data distribution while the mean itself is not a robust statistic. However, feature averaging is straightforward and conventional to represent a sequence such as in the recent works of video captioning and activity recognition. We argue that the framewise feature mean is unable to characterize the variation existing among frames. For instance, if we want the feature to represent the subject identity of a video, we had better preserve the overall pose diversity. Disregarding factors other than identity and pose, identity will be the only source of variation across videos since pose varies even within a single video. Using a simple metric, the correlation between two video features measures how likely the two videos represent the same person. Following such a variation disentanglement idea, we present a poserobust face verification algorithm using deeply learned CNN features. Instead of simply using all the frames, the algorithm is highlighted at the key frame selection using pose distances to Kmeans centroids, which reduces the number of feature vectors from hundreds to K while still preserving the overall diversity. On the official 5000 videopairs of YouTube Face dataset, our algorithm achieves a comparable performance with stateoftheart that averages over features of all frames.