8.3 KiB
An increasingly common use case is to want to stream text tokens from an LLM and have jambonz play them out in realtime or as close to realtime as possible. Currently, jambonz supports text-to-speech using either the say verb or the bidirectional listen verb and neither quite meets this need.
The "say" verb is synchronous, in the sense that you have to feed it a full, standalone piece of text and while the "listen" verb handles bidirectional audio it can be somewhat complicated to synchronize that playout stream with the other verbs that your app might be sending over the application webhook/websocket connection.
This article contains a proposal for supporting TTS streaming from LLMs over the application websocket, making it easier and more natural to implement in jambonz applications.
Note: Use of the websocket api is required to access this feature; given the asynchronous nature of the use case it is not possible to easily support using webhooks.
Proposed changes
We will break down the changes as follows:
- Changes to the
sayverb - Changes to the websocket api
We will then also show proposed changes to the node-client-ws library to support this, and examine what a sample application written to take advantage of tts streaming will look like.
Proposed changes to the "say" verb
We propose to make very simple changes to the "say" verb:
- Instead of providing the
textproperty you can alternatively provide astreamproperty with value "true". - Additionally, when doing so you can optionally provide a
closeOnStreamEmptyproperty which, when set to true, will cause thesayverb to end if all streamed tokens have been played out and the token input buffer is empty. The default value of this property shall be true.
For example, in a gather with a nested say you could simply do:
session
.gather({
say: {stream: true},
input: ['speech'],
actionHook: '/echo',
timeout: 15,
})
This can be either a gather that is running in the foreground or in the background.
When a say with streaming enabled is used, as in the above example; the audio will be provided from the application as a series of commands over the websocket. The text tokens provided will then be streamed using the configured TTS vendor.
Note: use of this feature requires selection of a TTS vendor that supports streaming. The initial vendors we intend to support and test with include Deepgram and PlayHT.
The behavior of the say verb when using the stream property will generally be unchanged, other than the fact that tts streaming is used. In particular, your app can still expect that:
- If the user barges in by speaking the audio will be killed.
- Your app can receive a
verb:statusmessage over the websocket with "speech-bargein-detected" when the user barges in. - The say verb will end when the audio completes playing out.
- If the say verb is killed the audio will be stopped.
We do not see any other changes needed to the jambonz verbs to support this feature, and we regard it as a value to keep things simple like this in terms of the changes for application developers.
Proposed changes to the websocket api
The changes to the websocket api will involve some new messages sent by jambonz to the application and some new messages sent by the application to jambonz
jambonz -> application
One new message type will be sent from jambonz to the application: tts:streaming-event.
| message type | sent by | usage |
|---|---|---|
| tts:streaming-event | jambonz | will include an event_type property as described below |
The message will include one of the following event_types:
stream_open: any text tokens sent from this time on will be immediately processed and played outstream_closed: any text tokens sent from this time on will be queued for processing once the stream is open againstream_paused: if the application has provided too many tokens, jambonz will ask the application to throttle by sending this event.stream_resumed: if the stream has been paused per above, after enough queued tokens have been processed this event will be sent to indicate that the application may again send tokens.stream_error: reports a processing error of some kind to the application.audio_playback_start: the first byte of audio received from the TTS vendor has been played out. This event will follow thestream_openevent and will be sent once only after that preceeding event to notify the application that the user is now hearing the audio.audio_playback_done: while the stream is open this event will be sent if all received and queued text tokens have been played out.user_bargein: the user has barged in and the audio playback has been stopped and any queued audio has been flushed.
The meaning of stream_open and stream_closed deserve a bit more of an explanation.
When an application first starts, it may begin sending text tokens at any time. However, if jambonz is not currently executing a say verb those tokens will be queued. When jambonz navigates to a say verb using the stream property, then a stream_open event type will be sent to the application (and any queued text tokens will be processed). When a say verb ends, the stream_closed event type will be sent, indicating that any text tokens received at this point will be queued until another streaming say is executed.
When jambonz sends an event_type of stream_open it will include a property indicating the number of queued words that is is now processing, if any.
This notion of "stream" refers conceptually to the stream between the application and jambonz, it does not mean for instance that we will be connecting and disconnecting from the TTS provider during a call session. Rather, when jambonz first executes a say with stream it shall connect to the TTS provider and it maintain that connection to the TTS provider for the remainder of the call.
application -> jambonz
The websocket server application may send the following new messages to jambonz:
| message type | sent by | usage |
|---|---|---|
| tts:tokens | application | text tokens that should be played out, see below for additional properties |
| tts:flush | application | a command to tell jambonz to kill the audio as well as any queued tokens |
The application can send text tokens at any time to jambonz using the new tts:tokens command. The payload shall include a tokens property containing the text to stream.
Additionally, the payload must include an id property that uniquely identifies this particular set of tokens. The id may be a number or a string. This id will be returned to the application in the last_processed_id property of a stream_paused or stream_resumed event sent by jambonz so that the application is able to synchronize the pause/resume of a token stream when necessary.
{
type: "tts:tokens",
id: 100
tokens: "It was the year 1500, an important time for Portugual,"
}
Note that you can not specify a tts vendor, language or voice in the
tts:tokenscommand. That is still done in the same way as before; using either application defaults or overriding withsay.synthesizer.
Proposed changes to the npm client
The session class will have the following new methods:
sendTextTokens: used to send atts:tokensmessage. This can be used at any time to asynchronously send text tokens for tts streaming.flushTextTokens: used to send atts:flushmessage, also can be sent asynchronously.
and the session class will emit the following new events:
tts:stream_opentts:stream_closedtts:stream_pausedtts:stream_resumedtts:stream_errortts:audio_playback_starttts:audio_playback_donetts:user_bargein
These events will have the meaning described above.
Example use cases
A good way to implement LLM integration would be to have an app that does a background gather with a nested say using the stream property. As the LLM streams text tokens you simply pipe them on to jambonz using the tts:tokens message type. If you get an indication that the caller barged in via the user_bargein message type, you accordingly tell the LLM to cancel processing. When you receive the next user transcript you then reengage the LLM, send a completion request and begin streaming text tokens again.