* wip

* wip

* wip

* attempt to fix bug in workflow

* minor
This commit is contained in:
Dave Horton
2024-11-19 12:44:29 -05:00
committed by GitHub
parent 3478906024
commit 24f11fa429
4 changed files with 193 additions and 4 deletions

View File

@@ -15,10 +15,12 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '20'
- name: Cache node modules
uses: actions/cache@v3
with:
@@ -26,9 +28,23 @@ jobs:
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-node-
- name: Cache Cypress binary
uses: actions/cache@v3
with:
path: ~/.cache/Cypress
key: ${{ runner.os }}-cypress-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
${{ runner.os }}-cypress-
- name: Install Dependencies
run: npm install
- name: Install Cypress
run: npx cypress install
- name: Build and Start Next.js
run: npm run build && (npm run start&) > /dev/null
- name: Run Tests
run: npm run test

View File

@@ -0,0 +1,47 @@
# Controlling the media path during a call
When using the [dial verb](/docs/webhooks/hangu) to create a bridged call there a few ways the media can be routed:
- **full media** - we call it "full media" if the audio continues to be routed through the feature server. From the perspective of the caller, their audio is routed to the jambonz SBC, through the feature server, and back out through the jambonz SBC on its outward path to the called party. This media path is necessary if an application wants to perform operations like transcribe, listen, or record the call.
- **partial media** - we call it "partial media" if the audio path for the bridged call is released from the featuer server and only traverses the SBC(s) on its journey from caller to called party.
- **no media** - we call if "no media" if the audio is completely released from jambonz. In this case, the audio path from the caller's SBC is directly to the far end SBC or SIP trunk.
**Note**: creating a "no media" media route is entirely dependent on the originating and terminating SBCs/gateways abilities to accept a re-INVITE with a change of media source. Due to how different enterprises and providers apply different policies for whitelisting media IP addresses, you should not expect this to work in all cases and this feature should be tested in advance for specific use cases with specific providers.
## Dial verb
### Default behavior
By default, when the `dial` verb is used jambonz will examine what options are used on the call and, if possible, it will release the media to the SBC (i.e. "partial media"). If the application is recording, transcribing, or using answering machine detection for instance, the call can not be released to the SBC and will continue to be routed through the feature server.
The following options can be used to override this behavior:
### I don't want to release the media.
Even if the media _could_ be released to the SBC, you may prefer for some reasons to continue to route it through the feature server. To do so, simply set the `dial.anchorMedia` property to true. This creates a "full media" bridged call
### I want to release the media completely.
If you want to release the media completely from the jambonz system, set the `dial.exitMediaPath` property to true. This will attempt to crate a "no media" bridge call - but please keep in mind the warning above that this is dependent on the 3rd party providers/SBC you are connecting to.
Note also that the 'no media' path is only possible if the `JAMBONES_ENABLE_FULL_MEDIA_RELEASE` environment variable has been set for both the sbc-inbound and sbc-outbound applications.
## Dynamically changing the media path during a call
The instructions above allow you to set a preferred media path at the start of the `dial` verb. You can also dynamically change the media path during a call by sending a live call control REST api call or the similar using a websocket application connection.
For instance, using the [@jambonz/node-client-ws](https://www.npmjs.com/package/@jambonz/node-client-ws) library you could switch an existing dialed call in progress to a "no media" call so that an agent could take a PCI compliant credit card transaction by doing the following:
```js
session.injectCommand('media:path', 'no-media');
```
Later, once the transaction has been completed, you could switch the call back to a "full media" call:
```js
session.injectCommand('media:path', 'full-media');
```
Due to the switching of the audio path, both parties will hear a brief loss of audio while the audio path is re-established.

View File

@@ -0,0 +1,117 @@
An increasingly common use case is to want to stream text tokens from an LLM and have jambonz play them out in realtime or as close to realtime as possible. Currently, jambonz supports text-to-speech using either the `say` verb or the bidirectional `listen` verb and neither quite meets this need.
The "say" verb is synchronous, in the sense that you have to feed it a full, standalone piece of text and while the "listen" verb handles bidirectional audio it can be somewhat complicated to synchronize that playout stream with the other verbs that your app might be sending over the application webhook/websocket connection.
This article contains a proposal for supporting TTS streaming from LLMs over the application websocket, making it easier and more natural to implement in jambonz applications.
> Note: Use of the [websocket api](https://www.jambonz.org/docs/ws/overview/) is required to access this feature; given the asynchronous nature of the use case it is not possible to easily support using webhooks.
# Proposed changes
We will break down the changes as follows:
- Changes to the `say` verb
- Changes to the websocket api
We will then also show proposed changes to the [node-client-ws](https://github.com/jambonz/node-client-ws) library to support this, and examine what a sample application written to take advantage of tts streaming will look like.
## Proposed changes to the "say" verb
We propose to make very simple changes to the "say" verb:
- Instead of providing the `text` property you can alternatively provide a `stream` property with value "true".
- Additionally, when doing so you can optionally provide a `closeOnStreamEmpty` property which, when set to true, will cause the `say` verb to end if all streamed tokens have been played out and the token input buffer is empty. The default value of this property shall be true.
For example, in a `gather` with a nested say you could simply do:
```js
session
.gather({
say: {stream: true},
input: ['speech'],
actionHook: '/echo',
timeout: 15,
})
```
This can be either a gather that is running in the foreground or in the background.
When a `say` with streaming enabled is used, as in the above example; the audio will be provided from the application as a series of commands over the websocket. The text tokens provided will then be streamed using the configured TTS vendor.
> Note: use of this feature requires selection of a TTS vendor that supports streaming. The initial vendors we intend to support and test with include Deepgram and PlayHT.
The behavior of the `say` verb when using the `stream` property will generally be unchanged, other than the fact that tts streaming is used. In particular, your app can still expect that:
- If the user barges in by speaking the audio will be killed.
- Your app can receive a `verb:status` message over the websocket with "speech-bargein-detected" when the user barges in.
- The say verb will end when the audio completes playing out.
- If the say verb is killed the audio will be stopped.
We do not see any other changes needed to the jambonz verbs to support this feature, and we regard it as a value to keep things simple like this in terms of the changes for application developers.
## Proposed changes to the websocket api
The changes to the websocket api will involve some new messages sent by jambonz to the application and some new messages sent by the application to jambonz
### jambonz -> application
One new message type will be sent from jambonz to the application: `tts:streaming-event`.
|message type|sent by|usage|
|---|---|---|
|tts:streaming-event|jambonz|will include an `event_type` property as described below|
The message will include one of the following event_types:
- `stream_open`: any text tokens sent from this time on will be immediately processed and played out
- `stream_closed`: any text tokens sent from this time on will be queued for processing once the stream is open again
- `stream_paused`: if the application has provided too many tokens, jambonz will ask the application to throttle by sending this event.
- `stream_resumed`: if the stream has been paused per above, after enough queued tokens have been processed this event will be sent to indicate that the application may again send tokens.
- `stream_error`: reports a processing error of some kind to the application.
- `audio_playback_start`: the first byte of audio received from the TTS vendor has been played out. This event will follow the `stream_open` event and will be sent once only after that preceeding event to notify the application that the user is now hearing the audio.
- `audio_playback_done`: while the stream is open this event will be sent if all received and queued text tokens have been played out.
- `user_bargein`: the user has barged in and the audio playback has been stopped and any queued audio has been flushed.
The meaning of stream_open and stream_closed deserve a bit more of an explanation.
When an application first starts, it may begin sending text tokens at any time. However, if jambonz is not currently executing a `say` verb those tokens will be queued. When jambonz navigates to a `say` verb using the `stream` property, then a `stream_open` event type will be sent to the application (and any queued text tokens will be processed). When a `say` verb ends, the `stream_closed` event type will be sent, indicating that any text tokens received at this point will be queued until another streaming `say` is executed.
When jambonz sends an event_type of `stream_open` it will include a property indicating the number of queued words that is is now processing, if any.
This notion of "stream" refers conceptually to the stream between the application and jambonz, it does not mean for instance that we will be connecting and disconnecting from the TTS provider during a call session. Rather, when jambonz first executes a `say` with `stream` it shall connect to the TTS provider and it maintain that connection to the TTS provider for the remainder of the call.
### application -> jambonz
The websocket server application may send the following new messages to jambonz:
|message type|sent by|usage|
|---|---|---|
|tts:tokens|application|text tokens that should be played out, see below for additional properties|
|tts:flush|application|a command to tell jambonz to kill the audio as well as any queued tokens|
The application can send text tokens at any time to jambonz using the new `tts:tokens` command. The payload shall include a `tokens` property containing the text to stream.
Additionally, the payload must include an `id` property that uniquely identifies this particular set of tokens. The id may be a number or a string. This id will be returned to the application in the `last_processed_id` property of a `stream_paused` or `stream_resumed` event sent by jambonz so that the application is able to synchronize the pause/resume of a token stream when necessary.
```js
{
type: "tts:tokens",
id: 100
tokens: "It was the year 1500, an important time for Portugual,"
}
```
> Note that you can not specify a tts vendor, language or voice in the `tts:tokens` command. That is still done in the same way as before; using either application defaults or overriding with `say.synthesizer`.
## Proposed changes to the npm client
The `session` class will have the following new methods:
- `sendTextTokens`: used to send a `tts:tokens` message. This can be used at any time to asynchronously send text tokens for tts streaming.
- `flushTextTokens`: used to send a `tts:flush` message, also can be sent asynchronously.
and the `session` class will emit the following new events:
- `tts:stream_open`
- `tts:stream_closed`
- `tts:stream_paused`
- `tts:stream_resumed`
- `tts:stream_error`
- `tts:audio_playback_start`
- `tts:audio_playback_done`
- `tts:user_bargein`
These events will have the meaning described above.
## Example use cases
A good way to implement LLM integration would be to have an app that does a background gather with a nested say using the `stream` property. As the LLM streams text tokens you simply pipe them on to jambonz using the `tts:tokens` message type. If you get an indication that the caller barged in via the `user_bargein` message type, you accordingly tell the LLM to cancel processing. When you receive the next user transcript you then reengage the LLM, send a completion request and begin streaming text tokens again.

View File

@@ -53,15 +53,17 @@ You can use the following attributes in the `dial` command:
| option | description | required |
| ------------- |-------------| -----|
| actionHook | webhook to invoke when the call ends. The webhook will include [properties](#dial-action-properties) describing the outcome of the call attempt.| no |
| actionHook | webhook to invoke when the call ends. The webhook will include [properties](#h5-actionhook-properties) describing the outcome of the call attempt.| no |
|amd|enable answering machine detection; see [answering machine detection](/docs/supporting-articles/answering-machine-detection) for details|no|
| anchorMedia | if true, jambonz will not release the media from freeswitch for the bridged call. [See here](/docs/supporting-articles/controlling-media-path-during-call) for details. Default: false | no |
| answerOnBridge | If set to true, the inbound call will ring until the number that was dialed answers the call, and at that point a 200 OK will be sent on the inbound leg. If false, the inbound call will be answered immediately as the outbound call is placed. <br/>Defaults to false. | no |
| callerId | The inbound caller's phone number, which is displayed to the number that was dialed. The caller ID must be a valid E.164 number. <br/>Defaults to caller id on inbound call. | no |
| confirmHook | webhook for an application to run on the callee's end after the dialed number answers but before the call is connected. This allows the caller to provide information to the dialed number, giving them the opportunity to decline the call, before they answer the call. Note that if you want to run different applications on specific destinations, you can specify the 'url' property on the nested [target](#target-types) object. | no |
| confirmHook | webhook for an application to run on the callee's end after the dialed number answers but before the call is connected. This allows the caller to provide information to the dialed number, giving them the opportunity to decline the call, before they answer the call. Note that if you want to run different applications on specific destinations, you can specify the 'url' property on the nested [target](#h5-target-types) object. | no |
| dialMusic | url that specifies a .wav or .mp3 audio file of custom audio or ringback to play to the caller while the outbound call is ringing. | no |
| dtmfCapture | an array of strings that represent dtmf sequence which, when detected, will trigger a mid-call notification to the application via the configured `dtmfHook` | no |
| dtmfHook | a webhook to call when a dtmfCapture entry is matched. This is a notification only -- no response is expected, and any desired actions must be carried out via the REST updateCall API. | no|
| dub | a nested [dub](/docs/webhooks/dub) verb to add additional audio tracks into the outbound call. | no |
| exitMediaPath | if true, jambonz will attempt to re-invite itself completely out of the media path for the call; [see below](#h5-exitmediapath) for details. Defaults to false| no |
| headers | an object containing arbitrary sip headers to apply to the outbound call attempt(s) | no |
| listen | a nested [listen](/docs/webhooks/listen) action, which will cause audio from the call to be streamed to a remote server over a websocket connection | no |
| referHook | webhook to invoke when an incoming SIP REFER is received on a dialed call. If the application wishes to accept and process the REFER, the webhook application should simply return an HTTP status code 200 with no body, and jambonz will send a SIP 202 Accepted. Otherwise, any HTTP non-success status will cause jambonz to send a SIP response to the REFER with the same status code. <br/><br/>Note that jambonz will send the 202 Accepted and do nothing further. It is the responsibility of the third-party application to then outdial a new call and bridge the other leg, presumably by using the REST API. See [this example app](https://github.com/jambonz/sip-blind-transfer) for more details.| no|
@@ -70,7 +72,7 @@ You can use the following attributes in the `dial` command:
| timeout | ring no answer timeout, in seconds. <br/>Defaults to 60. | no |
| transcribe | a nested [transcribe](#transcribe) action, which will cause the call to be transcribed | no
<h5 id="target-types">target types</h5>
##### target types
*PSTN number*
@@ -118,7 +120,7 @@ The `confirmHook` property that can be optionally specified as part of the targe
This allows you to easily implement call screening applications (e.g. "You have a call from so-and-so. Press 1 to decline").
<h5 id="dial-action-properties">actionHook properties</h5>
##### actionHook properties
The actionHook that is invoked when the dial command ends will include the following properties:
@@ -137,6 +139,13 @@ The webhook that is invoked when amd property is included and jambonz has either
| event | one of 'amd', 'beep', or 'silence' |
| amd_type| 'human' or 'machine', only provided when event = 'amd'|
##### exitMediaPath
> Added in 0.9.3
The purpose of the `exitMediaPath` is to support use cases where it is important that the media path not touch the jambonz system at all. The common use case is the need to transfer a call to a human agent or credit card system where the caller will be giving their credit card details over the phone. In order to have a PCI-Compliant transaction it is necessary that this conversation not be able to be recorded, stored, or in any way reach the jambonz system. Performing the `dial` verb using the `exitMediaPath` property ensures this happens.
For more details on controlling the media path during a call, please see [Controlling media path during call](/docs/supporting-articles/controlling-media-path-during-call).
<p class="flex">
<a href="/docs/webhooks/dequeue">Prev: dequeue</a>