mirror of https://github.com/jambonz/next-static-site.git synced 2025-12-19 04:47:44 +00:00

Files

Dave Horton 24f11fa429 wip (#101 )

* wip

* wip

* wip

* attempt to fix bug in workflow

* minor

2024-11-19 12:44:29 -05:00

8.3 KiB

Raw Blame History

An increasingly common use case is to want to stream text tokens from an LLM and have jambonz play them out in realtime or as close to realtime as possible. Currently, jambonz supports text-to-speech using either the say verb or the bidirectional listen verb and neither quite meets this need.

The "say" verb is synchronous, in the sense that you have to feed it a full, standalone piece of text and while the "listen" verb handles bidirectional audio it can be somewhat complicated to synchronize that playout stream with the other verbs that your app might be sending over the application webhook/websocket connection.

This article contains a proposal for supporting TTS streaming from LLMs over the application websocket, making it easier and more natural to implement in jambonz applications.

Note: Use of the websocket api is required to access this feature; given the asynchronous nature of the use case it is not possible to easily support using webhooks.

Proposed changes

We will break down the changes as follows:

Changes to the say verb
Changes to the websocket api

We will then also show proposed changes to the node-client-ws library to support this, and examine what a sample application written to take advantage of tts streaming will look like.

Proposed changes to the "say" verb

We propose to make very simple changes to the "say" verb:

Instead of providing the text property you can alternatively provide a stream property with value "true".
Additionally, when doing so you can optionally provide a closeOnStreamEmpty property which, when set to true, will cause the say verb to end if all streamed tokens have been played out and the token input buffer is empty. The default value of this property shall be true.

For example, in a gather with a nested say you could simply do:

    session
      .gather({
        say: {stream: true},
        input: ['speech'],
        actionHook: '/echo',
        timeout: 15,
      })

This can be either a gather that is running in the foreground or in the background.

When a say with streaming enabled is used, as in the above example; the audio will be provided from the application as a series of commands over the websocket. The text tokens provided will then be streamed using the configured TTS vendor.

Note: use of this feature requires selection of a TTS vendor that supports streaming. The initial vendors we intend to support and test with include Deepgram and PlayHT.

The behavior of the say verb when using the stream property will generally be unchanged, other than the fact that tts streaming is used. In particular, your app can still expect that:

If the user barges in by speaking the audio will be killed.
Your app can receive a verb:status message over the websocket with "speech-bargein-detected" when the user barges in.
The say verb will end when the audio completes playing out.
If the say verb is killed the audio will be stopped.

We do not see any other changes needed to the jambonz verbs to support this feature, and we regard it as a value to keep things simple like this in terms of the changes for application developers.

Proposed changes to the websocket api

The changes to the websocket api will involve some new messages sent by jambonz to the application and some new messages sent by the application to jambonz

jambonz -> application

One new message type will be sent from jambonz to the application: tts:streaming-event.

message type	sent by	usage
tts:streaming-event	jambonz	will include an `event_type` property as described below

The message will include one of the following event_types:

stream_open: any text tokens sent from this time on will be immediately processed and played out
stream_closed: any text tokens sent from this time on will be queued for processing once the stream is open again
stream_paused: if the application has provided too many tokens, jambonz will ask the application to throttle by sending this event.
stream_resumed: if the stream has been paused per above, after enough queued tokens have been processed this event will be sent to indicate that the application may again send tokens.
stream_error: reports a processing error of some kind to the application.
audio_playback_start: the first byte of audio received from the TTS vendor has been played out. This event will follow the stream_open event and will be sent once only after that preceeding event to notify the application that the user is now hearing the audio.
audio_playback_done: while the stream is open this event will be sent if all received and queued text tokens have been played out.
user_bargein: the user has barged in and the audio playback has been stopped and any queued audio has been flushed.

The meaning of stream_open and stream_closed deserve a bit more of an explanation.

When an application first starts, it may begin sending text tokens at any time. However, if jambonz is not currently executing a say verb those tokens will be queued. When jambonz navigates to a say verb using the stream property, then a stream_open event type will be sent to the application (and any queued text tokens will be processed). When a say verb ends, the stream_closed event type will be sent, indicating that any text tokens received at this point will be queued until another streaming say is executed.

When jambonz sends an event_type of stream_open it will include a property indicating the number of queued words that is is now processing, if any.

This notion of "stream" refers conceptually to the stream between the application and jambonz, it does not mean for instance that we will be connecting and disconnecting from the TTS provider during a call session. Rather, when jambonz first executes a say with stream it shall connect to the TTS provider and it maintain that connection to the TTS provider for the remainder of the call.

application -> jambonz

The websocket server application may send the following new messages to jambonz:

message type	sent by	usage
tts:tokens	application	text tokens that should be played out, see below for additional properties
tts:flush	application	a command to tell jambonz to kill the audio as well as any queued tokens

The application can send text tokens at any time to jambonz using the new tts:tokens command. The payload shall include a tokens property containing the text to stream.

Additionally, the payload must include an id property that uniquely identifies this particular set of tokens. The id may be a number or a string. This id will be returned to the application in the last_processed_id property of a stream_paused or stream_resumed event sent by jambonz so that the application is able to synchronize the pause/resume of a token stream when necessary.

{
   type: "tts:tokens",
   id: 100
   tokens: "It was the year 1500, an important time for Portugual,"
}

Note that you can not specify a tts vendor, language or voice in the tts:tokens command. That is still done in the same way as before; using either application defaults or overriding with say.synthesizer.

Proposed changes to the npm client

The session class will have the following new methods:

sendTextTokens: used to send a tts:tokens message. This can be used at any time to asynchronously send text tokens for tts streaming.
flushTextTokens: used to send a tts:flush message, also can be sent asynchronously.

and the session class will emit the following new events:

tts:stream_open
tts:stream_closed
tts:stream_paused
tts:stream_resumed
tts:stream_error
tts:audio_playback_start
tts:audio_playback_done
tts:user_bargein

These events will have the meaning described above.

Example use cases

A good way to implement LLM integration would be to have an app that does a background gather with a nested say using the stream property. As the LLM streams text tokens you simply pipe them on to jambonz using the tts:tokens message type. If you get an indication that the caller barged in via the user_bargein message type, you accordingly tell the LLM to cancel processing. When you receive the next user transcript you then reengage the LLM, send a completion request and begin streaming text tokens again.

8.3 KiB Raw Blame History