improve the responsiveness of onecore voices and sapi voices #13284

king-dahmanus · 2022-01-27T21:05:05Z

Is your feature request related to a problem? Please describe.

I'm always frostrated when sapi voices and onecore voices are slow and not responsive

Describe the solution you'd like

The voices should be responsive, so they could be mixed with other languages without an undesirable lag: I.e, using some hacks to unify onecore in sapi. Then they could be mixed, like between a latin voice and a non latin voice for optimal reading of both languages. Currently it's unnecessarily slow and unresponsive, which I kindly suggest that you fix

Describe alternatives you've considered

"""Based on advice from a developer who has some experienced with dsp""": Intercept the buffer from memory which has the audio, trim the silence at the beginning with a script which analises the amount of silence and trim it accordingly, then fead it back to the audio device

Additional context

nothing specific. Contact me if I can clarify some more. Please bare in mind that I'm not a programmer, I'm just a simple citizen. Thanks for your great help nv access! I'm sorry to say that I'm unable to monetarely support you. I wish that this project keeps helping blind people around the world like it always did.

cary-rowen · 2022-01-28T03:15:31Z

Yes, windows Sapi5 is noticeably more responsive on some screen readers, e.g. ZDSR

king-dahmanus · 2022-01-28T09:09:36Z

Yeah, if the silence could be trimmed, then sapi, or even one core would be as responsive as eloquence or espeak

…

On Fri, 28 Jan 2022 at 04:15, Rowen ***@***.***> wrote: Yes, windows Sapi5 is noticeably more responsive on some screen readers, e.g. ZDSR — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AT2FKJUCFM62JNHCH7XKYD3UYIC6JANCNFSM5M7BHI6A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mzanm · 2022-01-28T12:16:06Z

I agree, SAPI 5 and one core is somehow crazy fast on ZDSR.

king-dahmanus · 2022-01-28T13:04:03Z

I don't know what the bleeps they did, but it's possible that they too are trimming the silence from the beginning

…

On Fri, 28 Jan 2022 at 13:16, Mazen ***@***.***> wrote: I agree, SAPI 5 and one core is somehow crazy fast on ZDSR. — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AT2FKJU7SF2TLCXVRNZTBODUYKCJ7ANCNFSM5M7BHI6A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

LeonarddeR · 2022-01-29T12:26:34Z

While I'm an ESpeak user and not using Onecore very frequently, I find OneCore pretty responsive with NVDA. It would be helpful if findings about slow responsiveness are supported by measurable evidence.

king-dahmanus · 2022-01-29T14:02:26Z

I'm focusing here on sapi5. I mentioned one core because I used a program called sapi unifier to port the one core voices into sapi5

…

On Sat, 29 Jan 2022 at 13:26, Leonard de Ruijter ***@***.***> wrote: While I'm an ESpeak user and not using Onecore very frequently, I find OneCore pretty responsive with NVDA. It would be helpful if findings about slow responsiveness are supported by measurable evidence. — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AT2FKJUZJX2L2FLBYEVWBIDUYPMIJANCNFSM5M7BHI6A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

dpy013 · 2022-01-31T07:56:42Z

This is an audio from anyaubio, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer.

king-dahmanus · 2022-01-31T09:11:05Z

the link is broken

…

On Mon, 31 Jan 2022 at 08:56, DPY ***@***.***> wrote: This is an audio from anyaubio <http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk>, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer. — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

dpy013 · 2022-01-31T13:38:12Z

the link is broken
…
On Mon, 31 Jan 2022 at 08:56, DPY @.> wrote: This is an audio from anyaubio http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer. — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A . You are receiving this because you authored the thread.Message ID: @.>
http://anyaudio.net/listen?audio=TWu4HZNSSH0NTk

Thanks for reminding the above link has been re-edited

king-dahmanus · 2022-01-31T14:11:06Z

yeah it doesn't take me there for some reason. No matter, lets concentrate on nvda, cause this is what we're working with right?

…

On Mon, 31 Jan 2022 at 14:38, DPY ***@***.***> wrote: the link is broken … <#m_-7055658538261792368_> On Mon, 31 Jan 2022 at 08:56, DPY *@*.*> wrote: This is an audio from anyaubio http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk <http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk>, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer. — Reply to this email directly, view it on GitHub <#13284 (comment) <#13284 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A <https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A> . You are receiving this because you authored the thread.Message ID: @.*> Thanks for reminding the above link has been re-edited — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AT2FKJQZM5XU2TGEONDY6EDUY2GFBANCNFSM5M7BHI6A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

dpy013 · 2022-01-31T14:39:54Z

yeah it doesn't take me there for some reason. No matter, lets concentrate on nvda, cause this is what we're working with right?
…
On Mon, 31 Jan 2022 at 14:38, DPY @.> wrote: the link is broken … <#m_-7055658538261792368_> On Mon, 31 Jan 2022 at 08:56, DPY @.> wrote: This is an audio from anyaubio http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk http://anyaudio.net/audiodownload?audio=TWu4HZNSSH0NTk, listen to it to get an idea of how well zdsr supports the speed of the sapi5 speech synthesizer. — Reply to this email directly, view it on GitHub <#13284 (comment) <#13284 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A https://github.com/notifications/unsubscribe-auth/AT2FKJX6EAF7ME7CRXIISTTUYY6ELANCNFSM5M7BHI6A . You are receiving this because you authored the thread.Message ID: @.> Thanks for reminding the above link has been re-edited — Reply to this email directly, view it on GitHub <#13284 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT2FKJQZM5XU2TGEONDY6EDUY2GFBANCNFSM5M7BHI6A . You are receiving this because you authored the thread.Message ID: @.**>

yes

Adriani90 · 2023-05-03T16:50:01Z

I did some tests and found following after some minutes of using:

eSpeak: <1 ms to 5 ms time between key press and reporting, very rarely 8 ms or more
Onecore: 5 to 10 MS time between key press and reporting, rather 8 MS or more, rarely below 8 MS
Sapi5: at least 8 MS time between key press and reporting, usually 10 to 15 MS, rarely less than 10 MS.

taking eSpeak as reference, the expected behavior is to have all synths at the same performance level.

I tested with NVDA alpha-28179,345154a6 (2023.2.0.28179), WASAPI enabled, by using arrow keys in browse mode in Google Chrome 112, which is very responsive.
My 64bit Asus ROG strix machine has following configuration:
Procesor 12th Gen Intel(R) Core(TM) i9-12900H, 2500 MHz, 14 core(s), 20 logical treats
installed physical RAM 32,0 GB
Intel(R) IRIS grafic card total capacity 16 gb, VRAM = 128 MB
NVIDIA GeForce RTX 3070 Ti Grafic card, total capacity 24 GB, VRAM = 8 GB

As you can see, even on this machine there is a noticeable performance difference, so speaking about low end machines, the performance degradation between synths might be much more obvious.

cc: @jcsteh, @michaelDCurran

jcsteh · 2023-05-04T00:35:43Z

While it's possible there is some silence at the start of the audio buffer returned by these voices, it's also possible (I'd guess more likely) that these voices just take longer to synthesise speech. In that case, there's really nothing that can be done; the performance optimisation would need to happen in the voice itself.

For OneCore at least, if you already have a way to measure the time between key press and actual audio output, I'd suggest comparing with Narrator. That will give you an indication of whether this is something specific to NVDA or whether the voice itself is slow to respond.

cary-rowen · 2023-05-04T00:48:29Z

Narrator performance is worse than NVDA,
I recommend using ZDSR to compare with NVDA, the response speed of zdsr is significantly better than NVDA.
Even if both use SAPI5.

jcsteh · 2023-05-04T00:51:20Z

Is that true for OneCore with ZDSR even with the latest responsiveness and WASAPI changes in alpha?

SAPI5 is a different case, as NVDA uses SAPI5's own audio output rather than NVDA's audio output. It's possible that switching to nvwave + WASAPI for SAPI5 might improve responsiveness, but I'm not sure.

seanbudd · 2023-11-22T01:11:11Z

Are there any responsiveness issues remaining now that NVDA uses WASAPI?

jcsteh · 2023-11-22T02:02:50Z

Note that NVDA still doesn't use nvwave for SAPI5, so there won't be a change for SAPI5 now in terms of audio. However, the other responsiveness changes in the last few months might have some impact.

cary-rowen · 2023-11-22T02:16:30Z

Frankly, there are no noticeable changes.
I do think there's a lot of room for improvement in NVDA's responsiveness.

jcsteh · 2023-11-22T02:28:07Z

Given that there has been at least a measurable 10 to 30 ms improvement in responsiveness in NVDA in the last few months, not accounting for WASAPI, the fact that you're seeing "no noticeable changes" would suggest you're seeing a delay which is significantly larger than 30 ms with OneCore. That certainly doesn't match my experience, nor does it match #13284 (comment). That further suggests that there is a significant difference on your system as compared to mine and others.

As it stands, this issue isn't actionable. To get any further here, we're going to need precise information about which OneCore voice you're using, the rate it's configured at, probably audio recordings demonstrating the performance issue you're seeing, etc.

beqabeqa473 · 2024-09-16T11:41:02Z

Hello. I can confirm, that sapi5 in nvda is not as performant as in other places, and yes, this is because of sapi5 outputting sound itself. I am sure this will be improved, if sapi5 will go through nvda itself.

shenguangrong · 2024-12-17T03:44:55Z

Regarding the performance improvements for the SAPI5 speech synthesizer, I've attempted a solution to directly obtain audio data:

Create necessary SAPI objects via COM interface:
• Create SpVoice object for speech synthesis
• Create SpMemoryStream object to capture audio stream
• Create SpAudioFormat object to control audio format
The core approach is to redirect TTS output to memory:
• Configure SpAudioFormat audio parameters
• Set SpMemoryStream as SpVoice's output destination
• Obtain raw audio data directly from memory stream
It's important to note that this method retrieves the entire audio data at once, rather than streaming output. This presents several challenges:
• Need to consider appropriate text segmentation
• May need to implement strategies for segmented synthesis and playback
• Further research is required for optimization
This is just an initial implementation approach, and more in-depth research and improvements will be needed.

cary-rowen · 2024-12-17T03:52:32Z

Hi @jcsteh
Regarding the performance indicators of different speech synthesizers cc @gexgd0419 has conducted detailed tests in the following comment, which unfortunately is written in Chinese. @gexgd0419 might be able to expose its test methods or code if needed.
gexgd0419/NaturalVoiceSAPIAdapter#1 (comment)

gexgd0419 · 2024-12-31T08:53:45Z

This project might help you measure the latency during each step when using an SAPI5 voice.

The included TestTTSEngine can create voices that forward data to your installed SAPI5 voices and trim the leading silence part before outputting the audio. You can check how much this can improve the responsiveness.

If you use the TestTTSClient.exe, then you can see the log generated during speaking, and check the latency of each step.

The code I used to test the delay between keypress and audio output is not included yet. But I plan to include it later.

gexgd0419 · 2024-12-31T09:47:07Z

In the documentation of ISpAudio, Microsoft says:

In order to prevent multiple TTS voices or engines from speaking simultaneously, SAPI serializes output to objects which implement the ISpAudio interface. To disable serialization of outputs to an ISpAudio object, place an attribute called "NoSerializeAccess" in the Attributes folder of its object token.

You can notice the "serialization" performed by SAPI if you open two TTS clients and make them speak at the same time: only one of them can speak, and the other has to wait. Cross-process serialization might increase the delay.

Below are what I found using that test program on my system.

I found that if you let SAPI output audio to a memory stream, the "serialization" is bypassed, and the delay between the client calls Speak and receives the first chunk of audio data is usually less than 10 ms. But if you let SAPI output audio to the default device, the delay can increase to about 50~100 ms.

As for the leading silence duration, if you are using one of the built-in voices, at normal rate it's about 100 ms, and at the maximum rate it decreases to about 30~50 ms.

For example, this is a log I got when outputting to the default device.

Log (output to default device)

  Total/ms  Delta/ms Step
     31.27     31.27 Engine output format set, target: (null), final: PCM 16 kHz 16 bits Mono
     31.70      0.44 Client Speak start
     46.49     14.79 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
    857.38    810.89 Client StartStream event
    857.39      0.01 Engine Speak start
    860.38      2.99 Engine audio data written, 20.00 ms / 640 bytes silence skipped
    860.96      0.57 Engine Speak end
  1,273.37    412.41 Client EndStream event
  1,290.59     17.22 Client Speak end
  1,290.59      0.00 Client Speak start
  1,305.11     14.52 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  1,416.71    111.60 Client StartStream event
  1,416.72      0.01 Engine Speak start
  1,419.98      3.26 Engine audio data written, 31.62 ms / 1012 bytes silence skipped
  1,420.41      0.44 Engine Speak end
  1,773.15    352.74 Client EndStream event
  1,790.41     17.26 Client Speak end
  1,790.41      0.00 Client Speak start
  1,804.96     14.55 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  1,917.89    112.93 Client StartStream event
  1,917.90      0.01 Engine Speak start
  1,921.22      3.32 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
  1,921.72      0.50 Engine Speak end
  2,325.35    403.63 Client EndStream event
  2,343.40     18.05 Client Speak end
  2,343.41      0.01 Client Speak start
  2,358.35     14.94 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  2,473.57    115.22 Client StartStream event
  2,473.58      0.01 Engine Speak start
  2,476.68      3.10 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
  2,477.20      0.52 Engine Speak end
  2,900.30    423.10 Client EndStream event
  2,917.03     16.73 Client Speak end
  2,917.04      0.00 Client Speak start
  2,932.07     15.03 Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
  3,045.02    112.95 Client StartStream event
  3,045.02      0.01 Engine Speak start
  3,048.18      3.16 Engine audio data written, 31.44 ms / 1006 bytes silence skipped
  3,048.72      0.53 Engine Speak end
  3,482.50    433.78 Client EndStream event
  3,500.74     18.25 Client Speak end

You can see that there's more than 100 ms delay before each StartStream event, when the audio output hasn't even begun. The TTS engine starts synthesizing the voice after Engine Speak start happens.

If you output to a memory stream, the extra delay will be gone.

Log (output to memory stream)

  Total/ms  Delta/ms Step
      0.50      0.50 Engine output format set, target: (null), final: PCM 16 kHz 16 bits Mono
      1.11      0.61 Client Speak start
      1.13      0.02 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
      1.14      0.01 Client StartStream event
      1.15      0.01 Engine Speak start
      4.08      2.93 Client audio data received
      4.09      0.00 Engine audio data written, 20.00 ms / 640 bytes silence skipped
      5.46      1.37 Engine Speak end
      5.46      0.00 Client EndStream event
      5.47      0.01 Client Speak end
      5.49      0.03 Client Speak start
      5.51      0.01 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
      5.51      0.00 Client StartStream event
      5.52      0.01 Engine Speak start
      8.34      2.82 Client audio data received
      8.35      0.00 Engine audio data written, 31.62 ms / 1012 bytes silence skipped
      9.44      1.10 Engine Speak end
      9.45      0.00 Client EndStream event
      9.45      0.01 Client Speak end
      9.47      0.02 Client Speak start
      9.49      0.01 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
      9.49      0.00 Client StartStream event
      9.50      0.01 Engine Speak start
     12.67      3.17 Client audio data received
     12.68      0.01 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
     13.88      1.20 Engine Speak end
     13.88      0.00 Client EndStream event
     13.89      0.01 Client Speak end
     13.91      0.03 Client Speak start
     13.93      0.02 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
     13.94      0.01 Client StartStream event
     13.95      0.01 Engine Speak start
     17.03      3.09 Client audio data received
     17.04      0.01 Engine audio data written, 31.69 ms / 1014 bytes silence skipped
     18.32      1.28 Engine Speak end
     18.32      0.00 Client EndStream event
     18.33      0.01 Client Speak end
     18.35      0.03 Client Speak start
     18.37      0.02 Engine output format set, target: PCM 48 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
     18.38      0.01 Client StartStream event
     18.38      0.01 Engine Speak start
     21.28      2.90 Client audio data received
     21.29      0.00 Engine audio data written, 31.44 ms / 1006 bytes silence skipped
     22.65      1.36 Engine Speak end
     22.65      0.00 Client EndStream event
     22.66      0.01 Client Speak end

And that was fast.

But yes, the audio has to be output to an audio device in order to be heard, and the output process introduces more delay, so the final delay won't be that good. We can only hope that WASAPI introduces less delay than WinMM which SAPI5 uses internally.

EDIT: Tried outputting to the default device again after my computer fan started spinning, and the delay became smaller! So this can be affected by many things, including the active power plan and the resource usage of other applications. But there was still about 80 ms delay.

gexgd0419 · 2024-12-31T16:04:55Z

To make SAPI 5 voices able to use NVDA's own wave player (which uses WASAPI), we can try the following steps.

First, write an implementation class of COM interface IStream, which will receive the audio data. Here the audio data can be processed. After that, feed the audio data to the player.

from comtypes import COMObject
from objidl import IStream

class AudioStream(COMObject):
    _com_interfaces_ = [IStream]

    def __init__(self, fmt):
        self._writtenBytes = 0
        wfx = fmt.GetWaveFormatEx()  # SpWaveFormatEx
        self._player = nvwave.WavePlayer(
            channels=wfx.Channels,
            samplesPerSec=wfx.SamplesPerSec,
            bitsPerSample=wfx.BitsPerSample,
            outputDevice=config.conf["speech"]["outputDevice"],
        )

    def ISequentialStream_RemoteWrite(self, this, pv, cb, pcbWritten):
        # audio processing...
        self._player.feed(pv, cb)
        self._writtenBytes += cb
        if pcbWritten:
            pcbWritten[0] = cb
        return 0

    def IStream_RemoteSeek(self, this, dlibMove, dwOrigin, plibNewPosition):
        if dwOrigin == 1 and dlibMove.QuadPart == 0:
            # SAPI is querying the current position.
            if plibNewPosition:
                plibNewPosition[0].QuadPart = self._writtenBytes
                return 0
        return 0x80004001  # E_NOTIMPL is returned in other cases

Other methods of IStream can be left unimplemented.

Then, when initializing the SpVoice object, create a SAPI.SpCustomStream object to wrap your IStream implementation and the wave format for the stream.

# ... After setting the voice:
self.tts.AudioOutput = self.tts.AudioOutput  # Reset the audio and its format parameters
fmt = self.tts.AudioOutputStream.Format
stream = comtypes.client.CreateObject("SAPI.SpCustomStream")  # might be different for MSSP voices
stream.BaseStream = AudioStream(fmt)  # set the IStream being wrapped
stream.Format = fmt
self.tts.AudioOutputStream = stream  # Set the stream (wrapper) as the output target

Now you will be able to hear the voices. Not everything is processed properly in the code above, but I hope that you can get the idea.

One of the problems is that continuous reading will be broken, because the Bookmark events become out of sync with the audio stream. We will need to synchronize them ourselves.

gexgd0419 · 2025-01-01T13:33:57Z

Now this latency tester project supports measuring the delay between keypress and audio output, so I did some tests.

Used version:
NVDA: Run from source at current master branch
Narrator: on Win 11 23H2
ZDSR (ZhengDu Screen Reader): Public Welfare version (公益版)

Modifications:
Trimmed: Used the "forwarded" voice created by TestTTSEngine, so the leading silence is removed.
WASAPI: Used my modified version of NVDA, which sends the audio data via NVDA's WavePlayer.

Voice: Microsoft Huihui (Chinese, Simplified)

Results:

Client	Voice	Delay
NVDA	eSpeak NG	73ms
NVDA	Huihui OneCore	97ms
NVDA	Huihui SAPI5	176ms
NVDA	Huihui SAPI5 trimmed	139ms
NVDA	Huihui SAPI5 WASAPI	114ms
NVDA	Huihui SAPI5 WASAPI trimmed	77ms
Narrator	Huihui OneCore	76ms
Narrator	Huihui SAPI5	133ms
Narrator	Huihui SAPI5 trimmed	118ms
ZDSR 1.5.8.2	Huihui SAPI5	131ms
ZDSR 1.5.8.2	Huihui SAPI5 trimmed	94ms
ZDSR 1.7.0.0	Huihui OneCore	55ms
ZDSR 1.7.0.0	Huihui SAPI5	70ms
ZDSR 1.7.0.0	Huihui SAPI5 trimmed	57ms

cary-rowen · 2025-01-01T14:10:41Z

Cool, it looks like @gexgd0419 has made some real progress on this and has given a test result.

So far I'd be interested to hear what NV Access has to say about this or any pointers on the way forward.

cc @gerald-hartig @seanbudd

Also @jcsteh’s comments are valuable, can you talk about them?

I'm excited about the improved responsiveness

gexgd0419 · 2025-01-02T02:43:58Z

For example, here's the NVDA log when I pressed the S key, and the detected audio latency is 118.11ms, with the original SAPI5 implementation, but with leading silence trimmed.

IO - inputCore.InputManager.executeGesture (09:27:28.958) - winInputHook (30300):
Input: kb(desktop):s
IO - speech.speech.speak (09:27:28.998) - MainThread (30216):
Speaking [CharacterModeCommand(True), LangChangeCommand ('zh_CN'), 's', EndUtteranceCommand()]
DEBUG - synthDrivers.sapi5.SapiSink.EndStream (09:27:29.509) - MainThread (30216):
TestTTSEngine logged items:
0.00	NVDA Speak preparing
0.04	NVDA Speak start
14.17	Engine output format set, target: PCM 16 kHz 16 bits Mono, final: PCM 16 kHz 16 bits Mono
41.59	Engine Speak start
45.99	Engine audio data written, 39.38 ms / 1260 bytes silence skipped
46.72	Engine Speak end
49.67	NVDA StartStream
506.67	NVDA EndStream

From the timestamps in the log, we can get the following timeline:

Total	Delta	Step
0ms	-	Received keyboard input
40ms	40ms	Issued Speak command
44ms	4ms	synth.speak() called
86ms	42ms	Engine Speak start
90ms	4ms	Engine audio data written
94ms	4ms	NVDA StartStream
118ms	24ms	Audio detected
551ms	433ms	NVDA EndStream

There's 40ms delay between receiving the keyboard input and issuing the Speak command, and there's 20~30ms delay between writing audio data and outputing audio. So the minimum possible delay of NVDA on my system would be about 70ms, which could be achieved using eSpeak NG voice, or Huihui SAPI5 voice via WASAPI with leading silence trimmed.

jcsteh · 2025-01-03T00:51:35Z

There's 40ms delay between receiving the keyboard input and issuing the Speak command

This is a little tangential, but 40 ms is unexpectedly high there. I would expect something more like 20 ms or less, though it may be worse if you're running on battery.

Also, this raises another problem: handling typed characters seems to be pretty slow. If I do this in input help, I see 2 ms or less there. If I do this using the left or right arrow keys in the Run dialog edit field, I get 10 ms or less. This is a result of the optimisation work I did in #14928 and #14708. However, speaking of typed characters doesn't appear to benefit from this. This might be improved if we tweak eventHandler so that it always uses an immediate pump for typedCharacter events, just like we do for gainFocus events.

gexgd0419 · 2025-01-04T01:47:40Z

I tried to implement WASAPI on SAPI5 (and maybe SAPI4) further, but I think I need some help.

The problem is how I can synchronize the bookmark events with the audio stream.

IStream receives "streamed" audio data in chunks, rather than waiting for synthesis to complete and receiving all audio data. This approach might reduce startup delay, but here's a problem. Let's say NVDA needs to speak two sentences, A and B, with a bookmark in between. SAPI framework will stream in the audio data for A first, then the bookmark event, then audio data for B. This seems normal, until I found that WavePlayer.feed calls the callback after this audio chunk is played. But the program cannot know that the bookmark exists before it receives the audio for A! So is there a way to tell WavePlayer that I want to insert a callback to be called after the last fed chunk, but without feeding actual audio data?

Even worse, there's no guarantee that the bookmark event will happen right between audio for A and B, because audio and events are sent in different threads. But maybe this can be fixed by using ISpEventSource directly to get the events rather than using the automation compatible event interface.

Or maybe there's another way.

As the current implementation of OneCore voices is already using WavePlayer (and WASAPI), I checked the code and it seemed that all wave data are retrieved at once, instead of being "streamed" in chunks.

Related OneCore speech C++ code

winrt::fire_and_forget
speak(
    void* originToken,
    winrt::hstring text,
    std::shared_ptr<winrtSynth> synth,
    std::function<ocSpeech_CallbackT> cb
) {
    try {
        co_await winrt::resume_background();

        SpeechSynthesisStream speechStream{ nullptr };
        try {
            // Wait for the stream to complete
            speechStream = co_await synth->SynthesizeSsmlToStreamAsync(text);
        }
        catch (winrt::hresult_error const& e) {
            LOG_ERROR(L"Error " << e.code() << L": " << e.message().c_str());
            protectedCallback_(originToken, std::optional<SpeakResult>(), cb);
            co_return;
        }
        const std::uint32_t size = static_cast<std::uint32_t>(speechStream.Size());
        std::optional<SpeakResult> result(SpeakResult{
            Buffer(size),
            createMarkersString_(speechStream.Markers())  // send all markers (bookmarks) in a string
            }
        );
        try {
            // Read all data and send it to callback function in one go
            co_await speechStream.ReadAsync(result->buffer, size, InputStreamOptions::None);
            protectedCallback_(originToken, result, cb);
            co_return;
        }
        catch (winrt::hresult_error const& e) {
            LOG_ERROR(L"Error " << e.code() << L": " << e.message().c_str());
            protectedCallback_(originToken, std::optional<SpeakResult>(), cb);
            co_return;
        }
    }
    // ... catch blocks ...
}

Although asynchronous, the audio and all the markers for this entire utterance will be ready when the callback function is called.

This is more like @shenguangrong 's approach above, which uses SpMemoryStream to store all the audio data first. This might add some delay, but if utterances are short, the extra delay can be very short, only a few milliseconds. You can check the log in this comment when outputting to a memory stream. Engine Speak end (or all audio written) happened only 1~2 ms after the first audio chunk was written.

If the delay of OneCore voices is acceptable, then this approach is also feasible.

jcsteh · 2025-01-04T02:26:27Z

Let's say NVDA needs to speak two sentences, A and B, with a bookmark in between. SAPI framework will stream in the audio data for A first, then the bookmark event, then audio data for B. This seems normal, until I found that WavePlayer.feed calls the callback after this audio chunk is played. But the program cannot know that the bookmark exists before it receives the audio for A! So is there a way to tell WavePlayer that I want to insert a callback to be called after the last fed chunk, but without feeding actual audio data?

Not currently, though it might be possible to add it. However, you should be able to manufacture this already. One way would be to have a dict which maps from chunk id to bookmark id. Chunk id could be a simple counter which you increment for every chunk you feed or it could be something you easily get from SAPI; e.g. a stream position. After you call feed, keep track of the last chunk id in an instance variable. When you get the bookmark event, set map[lastChunkId] = bookmarkId.

Even worse, there's no guarantee that the bookmark event will happen right between audio for A and B, because audio and events are sent in different threads.

Yeah, this does seem like a source of intermittent timing problems.

As the current implementation of OneCore voices is already using WavePlayer (and WASAPI), I checked the code and it seemed that all wave data are retrieved at once, instead of being "streamed" in chunks.

That's correct. OneCore doesn't provide a streaming interface, unfortunately.

If the delay of OneCore voices is acceptable, then this approach is also feasible.

I don't think it is. We just don't have another choice. It causes unnecessary latency. Segmenting the text could help that, but it's not a true fix, just a workaround. This should always be a last resort and would IMO be an unacceptable regression.

jcsteh · 2025-01-04T02:31:54Z

So is there a way to tell WavePlayer that I want to insert a callback to be called after the last fed chunk, but without feeding actual audio data?

Actually, you should be able to do this: player.feed(None, size=0, onDone=someCallback)

gexgd0419 · 2025-01-07T06:31:00Z

I opened a pull request #17592 as my attempt to fix this.

Here's the build artifact files, which includes an installer exe file to install this alpha version. Can this improve the responsiveness of SAPI5 voices? Or does this introduce new bugs?

Also I need a way to test the audio ducking feature. But audio ducking requires UIAccess privilege, which requires the program be installed and signed. How can I test audio ducking using an alpha version which is not signed?

cary-rowen · 2025-01-07T07:09:00Z

Hi @gexgd0419
Great work!

Glad to see this PR I will test it later.
You can see the doc for creating a self-signed build here.

jcsteh · 2025-01-10T07:45:28Z

Note that this issue is described as covering both SAPI5 and OneCore, but I don't think #17592 does anything regarding OneCore.

gexgd0419 · 2025-01-10T07:59:38Z

The author said:

I'm focusing here on sapi5. I mentioned one core because I used a program called sapi unifier to port the one core voices into sapi5

OneCore voices are already using WASAPI, so their responsiveness cannot be improved using the same method.

feerrenrut added component/speech-synth-drivers component/speech performance labels Jan 28, 2022

This comment was marked as resolved.

Sign in to view

seanbudd added the blocked/needs-info The issue can not be progressed until more information is provided. label Nov 22, 2023

seanbudd mentioned this issue Sep 16, 2024

Strip silence from SAPI 5 utterances #17174

Closed

SaschaCowley mentioned this issue Dec 16, 2024

Performance improvements to sapi5 speech synthesizer #17524

Closed

gexgd0419 mentioned this issue Jan 7, 2025

Make SAPI5 & MSSP voices use WavePlayer (WASAPI) #17592

Merged

5 tasks

seanbudd closed this as completed in #17592 Jan 10, 2025

seanbudd closed this as completed in 631156c Jan 10, 2025

github-actions bot added this to the 2025.1 milestone Jan 10, 2025

improve the responsiveness of onecore voices and sapi voices #13284

improve the responsiveness of onecore voices and sapi voices #13284

Comments

king-dahmanus commented Jan 27, 2022

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

cary-rowen commented Jan 28, 2022

king-dahmanus commented Jan 28, 2022 via email

mzanm commented Jan 28, 2022

king-dahmanus commented Jan 28, 2022 via email

LeonarddeR commented Jan 29, 2022

king-dahmanus commented Jan 29, 2022 via email

dpy013 commented Jan 31, 2022 • edited Loading

king-dahmanus commented Jan 31, 2022 via email

dpy013 commented Jan 31, 2022 • edited Loading

king-dahmanus commented Jan 31, 2022 via email

dpy013 commented Jan 31, 2022

This comment was marked as resolved.

Adriani90 commented May 3, 2023

jcsteh commented May 4, 2023

cary-rowen commented May 4, 2023

jcsteh commented May 4, 2023

seanbudd commented Nov 22, 2023

jcsteh commented Nov 22, 2023

cary-rowen commented Nov 22, 2023

jcsteh commented Nov 22, 2023

beqabeqa473 commented Sep 16, 2024

shenguangrong commented Dec 17, 2024

cary-rowen commented Dec 17, 2024

gexgd0419 commented Dec 31, 2024

gexgd0419 commented Dec 31, 2024 • edited Loading

gexgd0419 commented Dec 31, 2024

gexgd0419 commented Jan 1, 2025 • edited Loading

cary-rowen commented Jan 1, 2025

gexgd0419 commented Jan 2, 2025

jcsteh commented Jan 3, 2025

gexgd0419 commented Jan 4, 2025

jcsteh commented Jan 4, 2025

jcsteh commented Jan 4, 2025 • edited Loading

gexgd0419 commented Jan 7, 2025

cary-rowen commented Jan 7, 2025

jcsteh commented Jan 10, 2025

gexgd0419 commented Jan 10, 2025

dpy013 commented Jan 31, 2022 •

edited

Loading

dpy013 commented Jan 31, 2022 •

edited

Loading

gexgd0419 commented Dec 31, 2024 •

edited

Loading

gexgd0419 commented Jan 1, 2025 •

edited

Loading

jcsteh commented Jan 4, 2025 •

edited

Loading