Synthesizing singing: what’s the buzz?

Sten Ternström* and David Howard**

*Dept of Speech, Music and Hearing, Kungliga Tekniska Högskolan, Stockholm, Sweden
**Dept of Electronics, University of York, UK

The voice quality of synthesizers that are based on source-filter modeling is often perceived as being too mechanical and lacking in naturalness. Some of this criticism can be ascribed to phonetic shortcomings such as inappropriate prosody and improbable renderings of transitions between phonemes. Even on sustained vowels, however, source-filter formant synthesis is often found wanting, for example as regards appropriate perturbations of fundamental frequency (F0), realistic aspirative noise, and source spectrum control. In particular, source-filter synthesizers seem to have a strong tendency for vowels to sound buzzy and metallic, to a degree that is rarely found in the output from live speakers or singers. In this investigation, we attempt to identify some acoustic features that cue the perception of buzziness.

Guided by informal experimentation, we hypothesize the following: perceived buzziness will increase when (a) there is more energy in a frequency band between 5 and 8 kHz; and/or (b) the F0 is more stationary.

Listening tests are underway in which subjects rate the buzziness of a number of stimuli. The stimuli are both natural and synthetic, belonging to one of six categories, as follows:

  1. A recording of a singer (trained baritone sustaining the vowel [a] on C4),
  2. The output of a synthetic vocal tract, spectrally matched to the singer in (1), and excited by the time derivative of the simultaneously recorded EGG signal.
  3. A synthesised source pulse train with constant F0, used to excite the same synthetic vocal tract.
  4. As in (3) but with random F0 flutter of 20 cents RMS.
  5. As in (3) but with a sinusoidal vibrato matched to the real singer’s vibrato (rate: 5.9 Hz, extent: 43 cent).
  6. As in (3) but with both flutter (4) and vibrato (5) added.

For each category, filtered versions of the above stimulus tones were generated, with sound level in a frequency band around 6 kHz being varied in five steps. All stimuli were matched for equal equivalent level. In order to reduce listener boredom and fatigue, additional redundant stimuli, with different pitches and vowels, were interspersed amongst the test stimuli.
The results will show to what extent high-frequency content is a predictor of buzziness, and whether or not stationarity in the fundamental frequency exhibits a significant interaction effect. The responses to stimuli in categories 1 and 2 may serve to indicate the possible relevance to buzziness of other factors in addition to F0 stationarity.

Work supported by STINT, the Swedish Foundation for International
Cooperation in Research and Higher Education, contract IG2002-2049

Click here to return to the schedule/abstract listing