In human listeners, the temporal voice areas (TVAs) are regions of the superior temporal gyrus and sulcus that respond more to vocal sounds than a range of nonvocal control sounds, including scrambled voices, environmental noises, and animal cries. One interpretation of the TVA’s selectivity is based on low-level acoustic cues: compared to control sounds, vocal sounds may have stronger harmonic content or greater spectrotemporal complexity. Here, we show that the right TVA remains selective to the human voice even when accounting for a variety of acoustical cues. Using fMRI, single vowel stimuli were contrasted with single notes of musical instruments with balanced harmonic-to-noise ratios and pitches. We also used “auditory chimeras”, which preserved subsets of acoustical features of the vocal sounds. The right TVA was preferentially activated only for the natural human voice. In particular, the TVA did not respond more to artificial chimeras preserving the exact spectral profile of voices. Additional acoustic measures, including temporal modulations and spectral complexity, could not account for the increased activation. These observations rule out simple acoustical cues as a basis for voice selectivity in the TVAs.