PhoneMyBot provides a way to control how text-to-speech (TTS) engines interpret the text coming from a chatbot, using a standard mechanism called Speech Synthesis Markup Language (SSML).
This includes not only specifying how a word should be pronounced (for instance, a foreign name) and how acronyms, telephone numbers, etc should be rendered, but also ways of enhancing the speech "emotionally", providing more emphasis to certain phrases and controlling volume, pitch, etc.
You can find the complete specs of SSML here, but PhoneMyBot simplifies the task by managing the creation of the SSML document: you only have to include the relevant tags into the text that the chatbot sends to PhoneMyBot.
TIP: SSML tags can be used both in the text of the message sent to PhoneMyBot via Direct API and in the message sent by a chatbot integrated via Adaptor.
In alphabetical order, these are the tags that PhoneMyBot supports. We don't cover the full complement of SSML, because we use different TTS engines that each support their own subset. So we cover the tags that are common among all TTS engines. Needless to say, they are the most useful.
<audio> allows to insert a recorded file within the text that the chatbot sends. It can be used for instance if you want to play music between two sentences, or to add some sound effects.
Example: "This is a beep <audio src="http://mydomain.com/beep.wav />"
Here "src" is the URL from where PhoneMyBot will fetch the audio file.
<break> inserts a pause of a specified duration in the speech. You can specify the duration either with a number in seconds/milliseconds (maximum 5000ms), or with the following keywords:
Example: "Hey you! <break time="300ms"/> How are you?"
Another example: "Come here. <break strength="strong"/> I am talking to you!"
Where "time" and "strength" are the selectors for the duration parameter.
<emphasis> requests that the indicated text be spoken with a certain level of emphasis, which could be strengthening or weakening the pronunciation. The levels are:
The way these are rendered depends on the TTS engine and the language, please test the final effect before releasing it.
Example: "Oh my, this is a <emphasis level="strong"> huge </emphasis> deal!"
<p> and <s> indicate a paragraph and a sentence respectively. The practical effect of using them is inserting automatic breaks after the sentence or paragraph, so they are simply a more convenient way to insert periods of silence in a longer text.
Different TTS engines insert different silence durations after a <p> and <s> block, so you should experiment with them before releasing.
Example:
<p>
<s>This is the first sentence of the paragraph.</s>
<s>Here's another sentence.</s>
</p>
The Phoneme tag forces the TTS engine to pronounce certain terms in a certain way, as defined using a phonetic alphabet. Examples may be surnames or cities, company names or any foreign word that may be mispronounced using plain TTS for the language the chatbot is conversing in.
The phonetic alphabet supported by PhoneMyBot is "IPA", using "x-sampa" as the input method of IPA based on ASCII. IPA is maintained by the International Phonetic Association. See for instance https://en.wikipedia.org/wiki/X-SAMPA for more information about x-sampa.
As an example, suppose you want the TTS engine to pronounce the name of the Italian actor Roberto Benigni. This is difficult because the sound "gni" in Italian does not have a correspondent in English. You can do it as follows:
"The actor and director Roberto <phoneme alphabet='x-sampa' ph='bE"nI:JI:'>Benigni </phoneme> won the Oscar as best actor for the movie Life is Beautiful."
The <prosody> tag controls the pitch, speaking rate and volume of the speech output. The attributes are:
Examples:
"I am saying this <prosody pitch="x-high"> really high </prosody>"
"Medicine commercials have <prosody rate="2"> very fast disclaimers </prosody>"
"I receive you <prosody volume="loud"> loud and clear </prosody>"
The <say-as> tag instructs the TTS to pronounce a word or phrase in a certain way, for instance by spelling it.
The attributes are: