Description:
The discriminative approaches for hand pose estimation from depth images usually require dense annotated data to train a supervised network. Additionally, generative methods depend on temporal information in generating candidate poses which can be trapped due to local minima during the optimization process. Different from these methods, we propose a hybrid two-stage deep predictive neural network approach that performs predictive coding of image sequences of hand poses in order to capture latent features underlying a given image. Firstly, we train a deep convolutional neural network (CNN) for direct regression of hand joints position. Secondly, we add an unsupervised error term as a part of the recurrent architecture connected with predictive coding portion. An error regression term (ERT) ensures minimal residual errors of the estimated values while the predictive coding portion allows training of the network without the supervision of image sequences, so no dense annotation of data is required. We conduct a complete experiment using two challenging public datasets, ICVL and NYU. Using the ICVL datasets, our approach improved accuracy over the current state of the art methods with an average error joint of 7.5mm. We also achieve 12.2mm average error joint on NYU dataset which is the smallest error to be achieved on all state-of-art approaches.