We propose a new method for single-camera realworld 3D human pose estimation. Our method uses multi-task training together with iterative pose refinement using a novel conditional attention mechanism. For iterative pose refinement, the output of each convolutional layer is conditioned on the latest pose estimate, using a Conditioned Squeeze-and-Excitation network architecture that incorporates novel feedback connections. Multi-task training on both an in-the-wild 2D pose dataset and a controlled 3D pose dataset allows for real-world 3D pose estimation without the need for a large-scale in-the-wild 3D pose dataset, which is unavailable. Experiments are performed on several real-world datasets, as well as the Human 3.6 Million and HumanEva-I datasets, to show that the combined attention mechanism, iterative refinement scheme and multi-task training allow us to achieve robust and competitive performance with only a simple network architecture. In addition, we show that our method is efficient enough to run on commodity hardware, producing pose estimates in real-time.