While content-based image retrieval (CBIR) is an expanding field, and new approaches to ever more effective retrieval are frequently proposed, relatively little attention has so far been paid to the process of evaluating the effectiveness of CBIR methods. Most of the reported evaluations use standard IR evaluation methodologies, with little consideration of their statistical significance or appropriateness for CBIR, which makes it difficult to assess the precise impact of individual methods. In this paper, we present a new approach for evaluating CBIR systems which provides both efficient and statistically-sound performance evaluation. The approach is based on stratified sampling, and provides a significant improvement over existing evaluation approaches. Comprehensive experiments using our approach to evaluate a range of CBIR methods have shown that the approach reduces not only the estimation error, but also reduces the size of the test data set required to achieve specific estimation error levels.