Since the introduction of the CUDA programming model, GPUs are considered a viable platform for accelerating non-graphical applications. Many cryptographic algorithms have been reported to achieve remarkable performance speedups, especially block ciphers. For stream ciphers, however, the lack of reported GPU acceleration endeavors is due to their inherent iterative structures that prohibit parallelization. In this paper, we propose an efficient implementation methodology for data-parallel cryptographic functions in a batch processing fashion on modern GPUs in general and optimizations for Salsa20 in particular. We present an autotuning framework to reach the most optimized set of device and application parameters for Salsa20 kernel variants with throughput maximization as a figure of merit. The peak performance achieved by our implementation for Salsa20/12 is 2.7 GBps and 43.44 GBps with and without memory transfers respectively on NVIDIA GeForce GTX 590. These figures beat the fastest reported GPU implementation of any stream cipher in the eSTREAM portfolio including Salsa20/12, as well as the block cipher AES optimized by hand-tuning, and thus, to the best of our knowledge set a new speed record.