With voice assistants on the rise, you probably have heard of SSML as a way to make your voice assistant sound less robotic and more humanlike. To use SSML effectively, it really helps if you’re familiar with some of the core concepts of speech. So I’ve put together a primer of voice and speech terms. Here’s part 1, more to follow soon.


Speech is the act of establishing communication between a sender and a receiver using sound. Speech is not written text.

‘Duh…’, I hear you say.

Yet, this is an important difference to note, because it influences the way you design your conversation. In written dialogue, a turn can be as long as you wish: your keyboard doesn’t object, and as long as your customers don’t mind a bit of reading, you can afford the extra text bubble.

In speech, not quite so. A turn in speech has a very different -and rather finite- boundary: breath.


We cannot speak more words that our breath can carry. And we usually stop our speech way before that point. If we carry on too long, it gets really physically uncomfortable, both for the speaker and for the listener.

That’s right, for the listener as well. And that’s thanks to our mirror neurons: when we listen to someone else, we unconsciously tune in to their breathing. So if your conversation partner rambles on and on, chances are that you’re holding your breath until he’s finished.

Fortuntely, we humans have a natural delimiter: there’s a point in time where we have to breathe in again. But what if your conversation partner is this Dutch Google Assistant?

This is how many Dutch voice actions sound at the moment. And without wanting to discredit all those voice enthousiasts that are cradling voice, giving it wings and making it fly, it’s not quite the listening experience that wants me to keep interacting with my smart speaker. Yet.

So when you design your voice action, make extra, extra sure to use spoken sample dialog, to make sure that you’re actually designing speech, rather than text. Create turns that pass the one-breath test, and use basic SSML to give your voice assistant some room to breathe.


Prosody refers to sound properties of entire utterances, rather than single words or sounds within a word. Examples of prosodic properties are intonation (the sentence melody), stress (where typical accent marks are placed) and the pitch (relative tone) of an utterance.

Prosody is a good term to remember, because this is what you’ll be working with most when you apply SSML tagging.


Intensity is basically the volume of a voice; whether it’s loud or weak, near or far away.

Image for post
Women and voice — Maaike Groenewege

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com


Pitch is what we perceive as a high or a deep voice. Pitch is determined by physical characteristics; it’s what in technical terms is known as the perceived fundamental frequency (‘grondtoon’ in Dutch). Women’s vocal folds are typically shorter and thinner than men’s, so the vibrate faster. Faster vibrating vocal folds lead to higher pitched voices.

Interesting fact: research into voice characteristics of transgender people has shown that pitch isn’t the main characteristic that people use to recognise whether a voice is male or female. According to this research, a person’s speech style plays a role too in determining a speaker’s gender.

So, a high voice is not necessarily perceived as female if the speaker’s speech style shows characteristics that are associated with male gender, or the other way round. This is interesting news for voice designers, because it opens up a world of possibilities to craft a wider, richer and more gradual palette of voices that do justice to the variety of people and voices that we meet all around us.


Confusingly, tone isn’t about pitch. Or musical notes. But in a way it is. Chinese. Roermonds. Tone language. Meh, let’s skip this, and move to the fun part!


Intonation is what we recognise as the melody of someone’s speech. It’s also what makes languages so recognisable and characteristic. Each language has its own and unique intonation pattern.

When I studied English, way back in time, one of my tutors said that you really don’t have to worry about the individual sounds of a foreign language: as long as you get the intonation right, you’re virtually there.

To illustrate this, here are some examples from Google assistant. In the first pair, the English voice pronounces two sentences: an English one and a Dutch one.

English sentence spoken by an English voice (And yes, that’s the original ConvoCat)
Dutch sentence spoken by an English voice

Aside from the rather exotic individual sounds in the Dutch sentence, you might notice that, overall, both sentences sound quite similar. That’s because they both have the same, three-steps-down sentence melody, or intonation.

(For Dutch speakers: note the highly pitched accent on ‘spoken’, which in Dutch is slightly misplaced on ‘ge-’. )

Image for post
English 3-step intonation pattern

So let’s turn things around and have a Dutch voice say something in English.

English sentence, spoken by a Dutch voice.

You’ll probably agree that she doesn’t sound very English at all. Whereas this one, though sounding slightly bored with the world, is pretty acceptable for a Dutch audience.

Dutch sentence, spoken by a Dutch voice.

And again, it’s the intonation pattern that’s the key differentiator here. Dutch moves in a smaller range than English, so we tend to sound a bit monotonous in English speaking ears.

Image for post
Dutch intonation pattern

Here’s a beautiful example of the Dutch intonation pattern in action.

To be continued…

Congratulations, you made it up to here! Did you find this overview useful? Anything else you want to know? I’d love to hear from you in the comments.

If you enjoyed this, make sure to watch out for part II of this primer, which will be about phonemes, meter, stress and rhythm.