Enhance TTS sentence boundary detection for Arabic and Japanese (#1464)

Update sentenceEndRegex to treat the following as sentence boundaries: ASCII .!? followed by whitespace or end-of-text; Arabic question mark (؟) and full stop (۔) with the same rule; Japanese 。, !, ? treated as boundaries regardless of following character; and double newlines (\n\n). This improves streaming chunking for mixed-language content.
This commit is contained in:
Vinod Dharashive
2025-12-08 21:14:20 +05:30
committed by GitHub
parent 7802822773
commit 1ad0261336

View File

@@ -437,7 +437,15 @@ class TtsStreamingBuffer extends Emitter {
const findSentenceBoundary = (text, limit) => {
// Look for punctuation or double newline that signals sentence end.
const sentenceEndRegex = /[.!?](?=\s|$)|\n\n/g;
// Includes:
// - ASCII: . ! ?
// - Arabic: ؟ (question mark), ۔ (full stop)
// - Japanese: 。 (full stop), , (full-width exclamation/question)
//
// For languages that use spaces between sentences, we still require
// whitespace or end-of-string after the mark. For Japanese (no spaces),
// we treat the punctuation itself as a boundary regardless of following char.
const sentenceEndRegex = /[.!?؟۔](?=\s|$)|[。!?]|\n\n/g;
let lastSentenceBoundary = -1;
let match;
while ((match = sentenceEndRegex.exec(text)) && match.index < limit) {