Release/0.8.2 (#59)

* add custom speech docs * add support for speech api * minor updates * minor docs tweak * custom speech docs
2026-01-25 02:08:03 +00:00 · 2023-03-28 13:55:14 -04:00
parent 4255e6834d
commit 017111ba32
8 changed files with 150 additions and 4 deletions
--- a/data/docs.yml
+++ b/data/docs.yml
@@ -119,6 +119,13 @@ navi:
      -
        path: command
        title: command
+  -
+    path: speech-api
+    title: Speech API
+    pages:
+      -
+        path: overview
+        title: Custom Speech API
  -
    path: client-sdks
    title: Client SDKs
--- a/markdown/docs/speech-api/overview.md
+++ b/markdown/docs/speech-api/overview.md
@@ -0,0 +1,17 @@
+# Custom speech API
+> Added in 0.8.2
+
+jambonz natively supports a large number of [STT](/docs/webhooks/recognizer) and [TTS](/docs/webhooks/say) speech providers out of the box, but if you want to use a speech provider that we don't support: no propblem! You can also add support for new speech vendors using our Speech APIs:
+
+- [custom STT API](/docs/supporting-articles/custom-speech-stt) (websocket-based)
+- [custom TTS API](/docs/supporting-articles/custom-speech-tts) (http-based)
+
+Adding support for a new STT or TTS vendor requires you to implement a server process that implements our APIs as described above, and integrates on the other side with your chosen vendor.  Check out [this example code](https://github.com/jambonz/custom-speech-example) which shows how to integrate [AssemblyAI](https://www.assemblyai.com/) as custom STT vendor, among others.
+
+Once you have added support for a new vendor, you can start using it immediately by logging into the jambonz portal and adding a new Speech service for that vendor.  You will specify the URL(s) that your server exposes as well as an api key that you create to secure the endpoint:
+
+![Creating custom STT vendor](/images/creating-custom-stt-vendor.png)
+
+Then, in the Application view simply select your custom vendor:
+
+![Selecting custom speech vendor](/images/using-custom-speech.png)
--- a/markdown/docs/supporting-articles/custom-speech-stt.md
+++ b/markdown/docs/supporting-articles/custom-speech-stt.md
@@ -0,0 +1,79 @@
+# Speech-to-text API
+> Added in 0.8.2
+
+jambonz provides native support for lots of speech recognition vendors, but if you want to integrate with a vendor we don't yet support you can easily do this by writing to our API.  
+
+The STT API is based on Websockets.  
+
+jambonz opens a Websocket connection towards a URL that you specify, and sends audio as well as JSON control text frames to your server.  Your server is responsible for implementing the interface to your chosen speech vendor and returning results in JSON format back over the Websocket connection to jambonz.
+
+Want to look at some working code?  Check out [these examples](https://github.com/jambonz/custom-speech-example).
+
+## Authentication
+
+An Authorization header is sent by jambonz on the HTTP request that creates the Websocket connection.  The Authorization header contains an api key, e.g.
+
+```js
+Authorization: Bearer <apiKey>
+```
+
+When you create a custom speech vendor in the jambonz portal you will specify an api key which is then then provided in the Authorization header whenever that custom speech vendor is used in your application.
+
+In the example below, we creeate a Custom speech service for [AssemblyAI](https://www.assemblyai.com/docs) and add an apiKey of 'foobarbazzle'.
+
+>> Note: this is *not* the API key that you may get from AssemblyAI to use their service.
+
+![Creating custom STT vendor](/images/creating-custom-stt-vendor.png)
+
+## Control messages sent by jambonz
+
+Control messages are sent as JSON frames.  Audio is sent as binary frames containing linear16 pcm-encoded audio at 8khz sampling.  
+
+The first message that you will receive from jambonz after accepting and upgrading the http request to a Websocket connection is a "start" control message, followed by binary audio frames.
+
+### Start control message
+
+| property | type | description  |
+| ---------|-------------| -----|
+| type | String | "start" |
+| language | String | ISO language code (e.g. "en-US") |
+| format | String | Defines audio format.  Currently will always be "raw" |
+| encoding | String | Defines how the audio is encoded.  Currently will always be "LINEAR16" |
+| interimResults | Boolean | whether or not interim (partial) results are being requested |
+| sampleRateHz | Number | Sample rate of audio.  Currently will always be 8000. |
+| options | Object | This will contain any options that the application is passing on to the recognizer. This object may be empty. |
+| options.hints | Array or Object | Any dynamic hints provided by the application. |
+| options.hintsBoost | Number | A boost number to apply to the provided hints. |
+
+### Stop control message
+
+jambonz sends a "stop" message when it is time to stop speech recognition.  
+
+| property | type | description  |
+| ---------|-------------| -----|
+| type | String | "stop" |
+
+## Control messages sent to jambonz
+
+Your server is responsible for sending transcriptions, as well as any errors, to jambonz.
+
+### Transcription control message
+
+| property | type | description  |
+| ---------|-------------| -----|
+| type | String | "transcription" |
+| is_final | Boolean | indicates whether this is a final or interim transcription. |
+| alternatives | Array | an ordered list of alternative transcriptions (must contain at least one). |
+| alternatives[n].transcript | String | A transcript of the speaker's utterance. |
+| alternatives[n].confidence | Number | A confidence probability, between 0 and 1. |
+| language | String | the language that was recognized. |
+| channel | Number | The channel number (only relevant if diarization is being performed, default to 1). |
+
+### Error control message
+
+| property | type | description  |
+| ---------|-------------| -----|
+| type | String | "error" |
+| error | String | detailed error message. |
+
+
--- a/markdown/docs/supporting-articles/custom-speech-tts.md
+++ b/markdown/docs/supporting-articles/custom-speech-tts.md
@@ -0,0 +1,41 @@
+# Text-to-speech API
+> Added in 0.8.2
+
+jambonz provides native support for lots of speech recognition vendors, but if you want to integrate with a vendor we don't yet support you can easily do this by writing to our API.  
+
+The TTS API is a simple http-based api.  
+
+jambonz sends an HTTP POST containing the text to be synthesized and associated properties such as language and voice.  Your server is responsible for implementing the interface to your chosen speech vendor and returning an mp3 file containing the audio.  
+
+Want to look at some working code?  Check out [these examples](https://github.com/jambonz/custom-speech-example).
+
+## Authentication
+
+An Authorization header is sent by jambonz on the HTTP request.  The Authorization header contains an api key, e.g.
+
+```js
+Authorization: Bearer <apiKey>
+```
+
+When you create a custom speech vendor in the jambonz portal you will specify an api key which is then then provided in the Authorization header whenever that custom speech vendor is used in yourj application.
+
+## Request body attributes
+
+| property | type | description  |
+| ---------|-------------| -----|
+| language | String | ISO language code (e.g. "en-US") |
+| voice | String | Name of voice to use |
+| type | String | "text" or "ssml"|
+| text | String | text to be synthesized (if type=ssml should be enclosed in <speak> tags) |
+
+## Response body attributes
+
+Your server should return a 200 OK containing a body with the synthesized speech in case of success, or an HTTP error code in case of a failure.  The format of the returned audio must be indicated in the Content-Type header; the following values are allowed:
+
+- audio/mpeg (or audio/mp3) - the content should be mp3 audio (this is the preferred format to return)
+- audio/wav (or audio/x-wav) - the content should be linear PCM audio with a wave header
+- audio/l16;rate=8000 - the content should be linear16 audio with 8khz sampling
+- audio/l16;rate=16000 - the content should be linear16 audio with 16khz sampling
+- audio/l16;rate=24000 - the content should be linear16 audio with 24khz sampling
+- audio/l16;rate=32000 - the content should be linear16 audio with 32khz sampling
+- audio/l16;rate=48000 - the content should be linear16 audio with 48khz sampling
--- a/markdown/docs/webhooks/recognizer.md
+++ b/markdown/docs/webhooks/recognizer.md
@@ -1,20 +1,22 @@

-The `recognizer` property is used in multiple verbs (gather, transcribe, etc). It selects and configures the speech recognizer.  It is an object containing the following properties:
+The `recognizer` property is used in multiple verbs ([gather](/docs/webhooks/gather), [transcribe](/docs/webhooks/transcribe), [dial](/docs/webhooks/dial)). It selects and configures the speech recognizer.  It is an object containing the following properties:

 | option        | description | required  |
 | ------------- |-------------| -----|
-| vendor | Speech vendor to use (google, aws, microsoft, deepgram, nuance, nvidia, and ibm are currently supported) | no |
+| vendor | Speech vendor to use (google, aws, microsoft, deepgram, nuance, nvidia, and ibm are supported, along with any others you add via the [custom speech API](/docs/speech-api/overview/)) | no |
 | language | Language code to use for speech detection.  Defaults to the application level setting | no |
 | interim | If true, interim transcriptions are sent | no (default: false) |
 | hints | (google, microsoft, deepgram, nvidia) Array of words or phrases to assist speech detection.  See [examples](#hints) below. | no |
 | hintsBoost | (google, nvidia) Number indicating the strength to assign to the configured hints.  See examples below. | no |
 | profanityFilter | (google, deepgram, nuance, nvidia) If true, filter profanity from speech transcription .  Default:  no| no |
+| singleUtterance | (google) If true, return only a single utterance/transcript | no (default: true for gather)|
 | vad.enable|If true, delay connecting to cloud recognizer until speech is detected|no|
 | vad.voiceMs|If vad is enabled, the number of milliseconds of speech required before connecting to cloud recognizer|no|
 | vad.mode|If vad is enabled, this setting governs the sensitivity of the voice activity detector; value must be between 0 to 3 inclusive, lower numbers mean more sensitive|no|
 | separateRecognitionPerChannel | If true, recognize both caller and called party speech using separate recognition sessions | no |
 | altLanguages |(google, microsoft) An array of alternative languages that the speaker may be using | no |
 | punctuation |(google) Enable automatic punctuation | no |
+| model |(google) speech recognition model to use | no (default: phone_call) |
 | enhancedModel |(google) Use enhanced model | no |
 | words |(google) Enable word offsets | no |
 | diarization |(google) Enable speaker diarization | no |
--- a/markdown/docs/webhooks/say.md
+++ b/markdown/docs/webhooks/say.md
@@ -1,6 +1,6 @@
 # say

-The say command is used to send synthesized speech to the remote party. The text provided may be either plain text or may use SSML tags.  The following vendors are supported: google, microsoft, aws, nuance, nvidia, ibm, and wellsaid, 
+The say command is used to send synthesized speech to the remote party. The text provided may be either plain text or may use SSML tags.  The following vendors are supported: google, microsoft, aws, nuance, nvidia, ibm, and wellsaid; along with any others you add via the [custom speech API](/docs/supporting-articles/custom-speech-tts).

 ```json
 {
@@ -18,7 +18,7 @@ You can use the following options in the `say` action:
 | option        | description | required  |
 | ------------- |-------------| -----|
 | text | text to speak; may contain SSML tags | yes |
-| synthesizer.vendor | speech vendor to use| no |
+| synthesizer.vendor | speech vendor to use (google, aws, microsoft, nuance, nvidia, and ibm are supported, along with any others you add via the [custom speech API](/docs/speech-api/overview/))| no |
 | synthesizer.language | language code to use.  | no |
 | synthesizer.gender | (Google only) MALE, FEMALE, or NEUTRAL.  | no |
 | synthesizer.voice | voice to use.  Note that the voice list differs whether you are using aws or Google. Defaults to application setting, if provided. | no |
--- a/public/images/creating-custom-stt-vendor.png
+++ b/public/images/creating-custom-stt-vendor.png
--- a/public/images/using-custom-speech.png
+++ b/public/images/using-custom-speech.png