Field-programmable gate arrays are ideal hosts to custom accelerators for signal, image, and data processing but de- mand manual register transfer level design if high performance and low cost are desired. High-level synthesis reduces this design burden but requires manual design of complex on-chip and off-chip memory architectures, a major limitation in applications such as video processing. This paper presents an approach to resolve this shortcoming. A constructive process is described that can derive such accelerators, including on- and off-chip memory storage from a C description such that a user-defined throughput constraint is met. By employing a novel statement-oriented approach, dataflow intermediate models are derived and used to support simple ap- proaches for on-/off-chip buffer partitioning, derivation of custom on-chip memory hierarchies and architecture transformation to ensure user-defined throughput constraints are met with minimum cost. When applied to accelerators for full search motion estima- tion, matrix multiplication, Sobel edge detection, and fast Fourier transform, it is shown how real-time performance up to an order of magnitude in advance of existing commercial HLS tools is enabled whilst including all requisite memory infrastructure. Further, op- timizations are presented that reduce the on-chip buffer capacity and physical resource cost by up to 96% and 75%, respectively, whilst maintaining real-time performance.