add dub verb (#84)

* add dub verb * wip * wip * wip * wip * wip * more docs * wip
2026-01-25 02:08:03 +00:00 · 2024-03-25 08:07:45 -04:00
parent 33a04e42d9
commit 1cd472be2c
11 changed files with 183 additions and 11 deletions
--- a/data/docs.yml
+++ b/data/docs.yml
@@ -27,6 +27,9 @@ navi:
      -
        path: dtmf
        title: dtmf
+      -
+        path: dub 
+        title: dub
      -
        path: enqueue
        title: enqueue
--- a/markdown/docs/supporting-articles/continuous-asr.md
+++ b/markdown/docs/supporting-articles/continuous-asr.md
@@ -3,7 +3,7 @@ Continuous ASR (automatic speech recognition) is a feature that allows the speec

 As an example, consider someone speaking a customer pin of 5 digits.  It might be common for them to pause while speaking, as they struggle to remember the full digit sequence.  They might say, for instance, "four eight two ....pause.....five nine".  In this situation, normal STT will be triggered by the pause to return "482" as the full utterance.  In a case where the jambonz application is submitting the user input to an AI bot, the input will be invalid.

-Continuous ASR applies to the [gather]() verb and provides the ability to specify some additional options that help ensure the collection of the full customer pin in the example above. Two additional options are provided:
+Continuous ASR applies to the [gather](/docs/webhooks/gather) verb and provides the ability to specify some additional options that help ensure the collection of the full customer pin in the example above. Two additional options are provided:

 - `asrTimeout`: this is a duration of silence, in seconds, to wait after a transcript is received from the STT vendor before returning the result.  If another transcript is received before this timeout elapses, then the transcripts are combined and recognition continues.  The combined transcripts are returned once a timeout between utterances exceeds this value or a specified dtmf termination key is detected (see below)
 - `asrDtmfTerminationDigit`: a DTMF key which, if entered, will also terminate the gather operation and immediately return the collected results.
--- a/markdown/docs/supporting-articles/using-dub-tracks.md
+++ b/markdown/docs/supporting-articles/using-dub-tracks.md
@@ -0,0 +1,47 @@
+# Using dub tracks
+jambonz allows you to insert additional audio tracks into the conversation; i.e to "dub in" additional tracks using the [dub](/docs/webhooks/dub) verb. 
+
+One common usage is to use background ambient noise to simulate an office environment, but there are many possibile ways to use this feature.
+
+```js
+const app = new WebhookResponse();
+app
+  .dub({
+    action: 'addTrack',
+    track: 'background-music',
+    play: 'https://djfyg.xyz/office-sounds.mp3',
+    loop: true,
+    gain: '-10 dB'
+  })
+  ... continue with other verbs
+res.status(200).json(app);
+```
+
+The `dub` verb is non-blocking, so the audio is started (or stopped, as is the case) and execution continues immediately.  A dub track can also be silenced, or totally removed at any time during the call. Dub tracks are identified by a user-specified name, e.g. "office sounds".
+
+When a dub track is silenced, the audio is stopped and the audio source removed from the track, but the track is left in place even though it is not generating any audio.  To restart audio in the track simply issue another `playOnTrack` or `sayOnTrack` dub command and remember, you must specify the new audio source (or text) when you restart it, even if you are using the same source as earlier.
+
+Often, when playing audio in a dub track you will want to decrease the volume (or, though less frequently, increase it).  You can do this using the `gain` options which specifies a number of decibels to increase or decrease the volume.  If you are not familiar with decibel units it is a logarithmic scale, as a rule of thumb you can keep in mind that -6 dB would reduce the audio signal strength by half.  You can also loop the audio continuously or play it once.
+
+Sometimes, while playing audio into a dub track you may also want to adjust the audio signal strength in the main track.  You can do so using the `config.boostAudioSignal` action:
+
+```js
+const app = new WebhookResponse();
+app
+  .config({
+    boostAudioSignal: '+1 dB'
+  })
+  .dub({
+    action: 'addTrack',
+    track: 'background-music',
+      play: 'https://djfyg.xyz/office-sounds.mp3',
+    loop: true,
+    gain: '-10 dB'
+  })
+```
+
+Dub tracks are automatically removed when the call ends, so there is no need to explicitly issue a dub verb with a `removeTrack` action during the call unless you are completely done with playing audio into the track for that call.
+
+The `dial` verb also allows for a nested `dub` verb, which causes the party answering the call to hear the dubbed audio track.  Note that the dub track audio is always only sent to one party so in the case of a dial where dubbed audio is sent to the called party the calling party will not hear that audio track.
+
+Finally, note that purpose of using a dub track is to blend audio into the call more or less continously, so do not use the `dub` verb for something like playing a typing sound or other "thinking noise" while an app or AI is processing a user response.  For that, use the [filler noise](/docs/supporting-articles/using-filler-noise) feature instead.
--- a/markdown/docs/supporting-articles/using-filler-noise.md
+++ b/markdown/docs/supporting-articles/using-filler-noise.md
@@ -0,0 +1,49 @@
+# Using filler noise
+
+Sometimes in conversational AI scenarios there may be significant latency while the remote application processes a user response and is determing the next action to take.  In these scenarios it is common to play a typing sound or other audio to provide an audio cue to the caller that the system is processing the response, that the agent is thinking or retrieving, etc.  
+
+Support for "filler noise" can enabled either at the session level using the `config.fillerNoise` property or at the individual `gather` level using the same property.  In the example below, we set a session-wide setting for filler noise (in the form of a typing sound) to kick in after waiting 2 seconds for the remote app to respond to user input.
+
+```js
+/* websocket application */
+session
+  .config({
+    fillerNoise: {
+      enable: true,
+      url: 'https://dygys.xyz/keyboard-typing.mp3',
+      startDelaySecs: 2
+    }
+  })
+  .gather({
+    say: {text: 'How can I help you today.'},
+    input: ['speech'],
+    ...
+  })
+  .send();
+```
+
+Later in the app, we may decide to start the filler noise immediately because we know that processing this particular user response could be time-consuming.
+
+```js
+/* websocket application */
+session
+  .config({
+    fillerNoise: {
+      enable: true,
+      url: 'https://dygys.xyz/keyboard-typing.mp3',
+      startDelaySecs: 2
+    }
+  })
+  .gather({
+    say: {text: 'OK, would you like me to go ahead and book the flight for you?'},
+    input: ['speech'],
+    fillerNoise: {
+      enable: true,
+      startDelaySecs: 0
+    }
+    ...
+  })
+  .send();
+```
+
+Note that I could have also overridden the url to play at that gather level, but in this case I chose to only override the delay (setting it to zero) and use the session-level typing sound.
--- a/markdown/docs/webhooks/config.md
+++ b/markdown/docs/webhooks/config.md
@@ -27,17 +27,22 @@ You can use the following attributes in the `config` command:

 | option        | description | required  |
 | ------------- |-------------| -----|
-|amd|enable answering machine detection; see [answering machine detection](/docs/supporting-articles/answering-machine-detection) for details|no|
-|bargein|this object contains properties that are used to instantiate a 'background' [gather verb](/docs/webhooks/gather)|no|
-| bargeIn.enable| if true, begin listening for speech or dtmf input while the session is executing other verbs.  This is known as a "background gather" and an application to capture user input outside of a [gather verb](/docs/webhooks/gather).  If false, stop any background listening task that is in progress| no|
-| bargeIn.sticky | If true and bargeIn.enable is true, then when the background gather completes with speech or dtmf detected, it will automatically start another background gather|no|
+|amd|enable answering machine detection; see [answering machine detection](/docs/supporting-articles/answering-machine-detection) for details.|no|
+|bargein|this object contains properties that are used to instantiate a 'background' [gather verb](/docs/webhooks/gather).|no|
+| bargeIn.enable| if true, begin listening for speech or dtmf input while the session is executing other verbs.  This is known as a "background gather" and an application to capture user input outside of a [gather verb](/docs/webhooks/gather).  If false, stop any background listening task that is in progress.| no|
+| bargeIn.sticky | If true and bargeIn.enable is true, then when the background gather completes with speech or dtmf detected, it will automatically start another background gather.|no|
 | bargeIn.actionHook | A webhook to call if user input is collected from the background gather.| no |
 | bargeIn.input |Array, specifying allowed types of input: ['digits'], ['speech'], or ['digits', 'speech']. | yes |
-| bargeIn.finishOnKey | Dmtf key that signals the end of dtmf input | no |
-| bargeIn.numDigits | Exact number of dtmf digits expected to gather | no |
+| bargeIn.finishOnKey | Dmtf key that signals the end of dtmf input. | no |
+| bargeIn.numDigits | Exact number of dtmf digits expected to gather. | no |
 | bargeIn.minDigits | Minimum number of dtmf digits expected to gather.  Defaults to 1. | no |
-| bargeIn.maxDigits | Maximum number of dtmf digits expected to gather | no |
+| bargeIn.maxDigits | Maximum number of dtmf digits expected to gather. | no |
 | bargeIn.interDigitTimeout | Amount of time to wait between digits after minDigits have been entered.| no |
+|boostAudioSignal| A string or integer value indicating the number of decibels to boost or reduce the strength of the outgoing audio signal to the caller/called party, e.g. "-6 dB". Note this applies to the main track only, not to any [dub](/docs/webhooks/dub) tracks.| no |
+| fillerNoise | play audio to the caller while the remote application is processing gather transcriptions. This is a session-wide setting that may be overridden at the [gather verb](docs/webhooks/gather) level. See [Using filler noise](/docs/supporting-articles/using-filler-noise) for more details.| no |
+| fillerNoise.enable | boolean, whether to enable or disable filler noise | yes |
+| fillerNoise.url | http(s) audio to play as filler noise | yes |
+| fillerNoise.startDelaySecs | integer value specifying number of seconds to wait for a response from the remote application before playing filler noise | no (default: play immediately after sending results) |
 | listen | a nested [listen](/docs/webhooks/listen) action, which allows recording of the call from this point forward by streaming the audio to a remote server over a websocket connection | no |
 | notifyEvents | boolean, whether to enable event notifications (verb:status messages) over websocket connections.  Verbs that are sent over the websocket must also contain an "id" property to activate this feature.|no|
 |onHoldMusic | string, provides the URL to a remote music source to use when a call is placed on hold|no|
@@ -47,6 +52,10 @@ You can use the following attributes in the `config` command:
 |record.action|"startCallRecording", "stopCallRecording", "pauseCallRecording", or "resumeCallRecording"|yes|
 |record.siprecServerURL|sip uri for SIPREC server|required if action is "startCallRecording"|
 |record.recordingID|user-supplied string to identify the recording|no|
+|transcribe| a nested [transcribe](/docs/webhooks/transcribe) action, which allows a transcription of the call to be sent in the background | no |
+|transcribe.enable| boolean, if true start the transcribe, if false stop it | yes |
+|transcribe.transcriptionHook| the webhook/websocket identifier to send transcriptions to | yes if enabling transcription |
+|transcribe.recognizer| [recognizer](/docs/webhooks/recognizer) options | no |
 |sipRequestWithinDialogHook|object or string, a webhook to call when a sip request is received within the dialog (e.g. an INFO, NOTIFY, or REFER)|no|
 | synthesizer | change the session-level default text-to-speech settings. See [the say verb](/docs/webhooks/say) for details on the `synthesizer` property.| no |

--- a/markdown/docs/webhooks/dial.md
+++ b/markdown/docs/webhooks/dial.md
@@ -61,6 +61,7 @@ You can use the following attributes in the `dial` command:
 | dialMusic | url that specifies a .wav or .mp3 audio file of custom audio or ringback to play to the caller while the outbound call is ringing. | no |
 | dtmfCapture | an array of strings that represent dtmf sequence which, when detected, will trigger a mid-call notification to the application via the configured `dtmfHook` | no |
 | dtmfHook | a webhook to call when a dtmfCapture entry is matched.  This is a notification only -- no response is expected, and any desired actions must be carried out via the REST updateCall API. | no|
+| dub | a nested [dub](/docs/webhooks/dub) verb to add additional audio tracks into the outbound call. | no |
 | headers | an object containing arbitrary sip headers to apply to the outbound call attempt(s) | no |
 | listen | a nested [listen](/docs/webhooks/listen) action, which will cause audio from the call to be streamed to a remote server over a websocket connection | no |
 | referHook | webhook to invoke when an incoming SIP REFER is received on a dialed call.  If the application wishes to accept and process the REFER, the webhook application should simply return an HTTP status code 200 with no body, and jambonz will send a SIP 202 Accepted.  Otherwise, any HTTP non-success status will cause jambonz to send a SIP response to the REFER with the same status code.  <br/><br/>Note that jambonz will send the 202 Accepted and do nothing further.  It is the responsibility of the third-party application to then outdial a new call and bridge the other leg, presumably by using the REST API.  See [this example app](https://github.com/jambonz/sip-blind-transfer) for more details.| no|
--- a/markdown/docs/webhooks/dtmf.md
+++ b/markdown/docs/webhooks/dtmf.md
@@ -19,5 +19,5 @@ You can use the following options in the `dtmf` action:

 <p class="flex">
 <a href="/docs/webhooks/dialogflow">Prev: dialogflow</a>
-<a href="/docs/webhooks/enqueue">Next: enqueue</a>
+<a href="/docs/webhooks/dub">Next: dub</a>
 </p>
--- a/markdown/docs/webhooks/dub.md
+++ b/markdown/docs/webhooks/dub.md
@@ -0,0 +1,60 @@
+
+![Dub](/images/dubbing.png)
+> Added in v0.8.6
+
+The `dub` verb adds one or more additional audio tracks into the conversation (currently, a max of two additional audio tracks may be added). Audio can then be inserted into these tracks and it will be blended with the `play` or `say` content being sent to the caller/called party.  The source of the audio content can be either text to speech or mp3 audio accessible via http(s).
+
+Additionally, the volume (gain) of the inserted audio may be adjusted up or down.  As well, the [config.boostAudioSignal](/docs/webhooks/config) allows the volume in the main conversational channel to be adjusted as well.
+
+```json
+  {
+    "verb": "dub",
+    "action": "addTrack",
+    "track": "ambient-noise",
+  },
+  {
+    "verb": "dub",
+    "action": "playOnTrack",
+    "track": "ambient-noise",
+    "play": "https://example.com/sounds/office-hubbub.mp3"
+  }
+```
+Verb properties for the `dub` command:
+
+| option        | description | required  |
+| ------------- |-------------| -----|
+| action | one of 'addTrack', 'removeTrack', 'silenceTrack', 'playOnTrack', or 'sayOnTrack' | yes |
+| track | label for the track | yes |
+| play | an http(s) url to an mp3 file to play into the track | no |
+| say | text to convert to audio and play into the track| no |
+| loop | boolean; if true, loop the mp3 | no |
+| gain | a string value in the format "-6 dB" specifying decibels to boost or reduce the strength of the audio signal (note: simple integer values accepted as well). The value supplied must be between +- 50 dB.| no |
+
+The various options are:
+- `addTrack` adds an audio track to the conversation; once added, the `play` or `say` command may be used to inject audio into the track
+- `removeTrack` removes an audio track from the conversation
+- `silenceTrack` silences an audio track but leaves it in place
+- `playOnTrack` plays audio from an http(s) mp3 url into the audio track
+- `sayOnTrack` generates text-to-speech into the audio track
+
+Note: all tracks are automatically removed when the call completes, so if using an additional track for the entire conversation there is no need to explicitly remove it when the call ends.
+
+Note: for convenience the `addTrack` and `playOnTrack` operations may be combined into a single `addTrack` verb; e.g.:
+
+```json
+  {
+    "verb": "dub",
+    "action": "addTrack",
+    "track": "ambient-noise",
+    "play": "https://example.com/sounds/office-hubbub.mp3",
+    "loop": true,
+    "gain": "-10 dB"
+  }
+```
+
+See [Using dub tracks](/docs/supporting-articles/using-dub-tracks) for more information.
+
+<p class="flex">
+<a href="/docs/webhooks/dtmf">Prev: DTMF</a>
+<a href="/docs/webhooks/enqueue">Next: enqueue</a>
+</p>
--- a/markdown/docs/webhooks/enqueue.md
+++ b/markdown/docs/webhooks/enqueue.md
@@ -43,6 +43,6 @@ The *waitHook* webhook will contain the following additional parameters:
 You can also optionally receive [queue webhook notifications](/docs/webhooks/queue-notifications) any time a members joins or leaves a queue.

 <p class="flex">
-<a href="/docs/webhooks/dtmf">Prev: dtmf</a>
+<a href="/docs/webhooks/dub">Prev: dub</a>
 <a href="/docs/webhooks/gather">Next: gather</a>
 </p>
--- a/markdown/docs/webhooks/gather.md
+++ b/markdown/docs/webhooks/gather.md
@@ -45,7 +45,10 @@ You can use the following options in the `gather` command:
 | numDigits | Exact number of dtmf digits expected to gather | no |
 | partialResultHook | Webhook to send interim transcription results to. Partial transcriptions are only generated if this property is set. | no |
 | play | nested [play](#play) Command that can be used to prompt the user | no |
-| [recognizer](/docs/webhooks/recognizer) | Speech recognition options | no |
+| fillerNoise | play audio to the caller while the remote application is processing gather transcriptions. See [Using filler noise](/docs/supporting-articles/using-filler-noise) for more details.| no |
+| fillerNoise.enable | boolean, whether to enable or disable filler noise | yes |
+| fillerNoise.url | http(s) audio to play as filler noise | yes |
+| fillerNoise.startDelaySecs | integer value specifying number of seconds to wait for a response from the remote | [recognizer](/docs/webhooks/recognizer) | Speech recognition options | no |

 In the case of speech input, the actionHook payload will include a `speech` object with the response from Google speech:
 ```json
--- a/public/images/dubbing.png
+++ b/public/images/dubbing.png