Custom Voice

10 min

for customers who wish to bring their own (byo) voice to uneeq, integration is supported via a simple voice orchestration service, similar to the conversation integration which can be achieved via synapse or your own custom solution similar to the conversation integration, a customer hosted endpoint should be created and allow external requests from the uneeq platform uneeq will post the following json payload to your byo service { "apikey" "\<api key>", "preset" "\<preset/voice>", "text" "\<text to speak>" } returning audio the byo integration api expects single channel raw (no header) pcm audio at 16khz a 2xx status code will be treated as a success and any resultant response body will be treated as audio will be forwarded to the avatar for rendering and will be played to the user samples must be returned as 16 bit linearly encoded signed integers with a little endian byte ordering it’s suggested that content is returned as type application/octet stream to summarise api returns 200 ok 16khz audio mono samples are little endian 16 bit signed integers raw pcm, i e no wav header or any encoding or compression application/octet stream returning feedback errors may be returned by setting a non 2xx status code those error codes will be counted but at the time of writing the error body is not captured responses with non 2xx status codes are not played out to the user sample implementation the listing below implements a byo tts application that calls out to the google cloud tts api if using the python sample below, you’ll need to have google cloud texttospeech installed see google’s documentation https //cloud google com/text to speech/docs/quickstart protocol for how to attain google application credentials note the first line in the sample uses pip to install the required dependencies if using the nodejs sample below, you'll need a valid subscription to microsoft azure cognitive services api, and values for the other environment variables you will find in the code note that there are two separate files in the nodejs code sample, you should create a new expressjs app, and then define the routes using the orchestration handler sample, and define the interface to microsoft's cognitive speech service using the microsoft services handler we expect that if you are reading this documentation, you are comfortable creating and hosting custom services if you require assistance, please contact us at help\@uneeq com \# >$ pip3 install google cloud texttospeech \#!/usr/local/bin/python3 6 import http server import socketserver import json import os from google cloud import texttospeech os environ\["google application credentials"]=" /googleauth json" # this file must exist if you are using google's text to speech service class handler(http server simplehttprequesthandler) def do get(self) \# this is a health api, and doesn't render text to speech self send response(200) self send header('content type', 'application/json') self end headers() self wfile write("{\\"status\\" \\"ok\\"}" encode('utf 8')) return def do post(self) \# the actual tts render api content len = int(self headers get('content length')) post body = self rfile read(content len) decode('utf 8') \# body contains api key and preset, but we are not using it print('body ' + post body) body json = json loads(post body) text = body json\['text'] print('text ' + text) client = texttospeech texttospeechclient() synthesis input = texttospeech synthesisinput(text=text) voice = texttospeech voiceselectionparams( language code='en us', name='en us wavenet c', ssml gender=texttospeech ssmlvoicegender female) \# note sample rate 16k, and audio encoding 16 bit linear audio config = texttospeech audioconfig( sample rate hertz=16000, audio encoding=texttospeech audioencoding linear16) response = client synthesize speech( input=synthesis input, voice=voice, audio config=audio config ) \# construct a server response note the status code 200 and content type self send response(200) self send header('content type', 'application/octet stream') self end headers() self wfile write(response audio content) return print('server listening on port 3130 ') httpd = socketserver tcpserver(('0 0 0 0', 3130), handler) httpd serve forever// orchestration handler var express = require('express'); var router = express router(); const sdk = require("microsoft cognitiveservices speech sdk"); const { texttospeech, texttospeechssml } = require(' /azure cognitiveservices speech'); const use ssml = process env use ssml; router get('/', function (req, res, next) { res send('respond with a resource'); }); router post('/', async function (req, res, next) { let audiostream console log(req body text) if (use ssml === 'true') { audiostream = await texttospeechssml(req body text); } else { audiostream = await texttospeech(req body text); } res set({ 'content type' 'application/octet stream', 'transfer encoding' 'chunked' }); audiostream pipe(res); }) module exports = router; // microsoft services handler // azure cognitiveservices speech js const sdk = require('microsoft cognitiveservices speech sdk'); const { buffer } = require('buffer'); const { passthrough } = require('stream'); const { outputformat } = require('microsoft cognitiveservices speech sdk'); const azure api key = process env azure api key; const azure region = process env azure region; const azure speaking style = process env azure speaking style; const azure output format = process env azure output format; const azure voice = process env azure voice; const prosody speed = process env prosody speed; const prosody pitch = process env prosody pitch; const use speaking style = process env use speaking style; const nlp outputs ssml = process env nlp outputs ssml; / node js server code to convert text to speech @returns stream @param { } key your resource key @param { } region your resource region @param { } text text to convert to audio/speech <== we only use text @param { } filename optional best for long text temp file for converted speech/audio / const texttospeech = async (text) => { // convert callback function to promise return new promise((resolve, reject) => { const speechconfig = sdk speechconfig fromsubscription(azure api key, azure region); speechconfig speechsynthesisoutputformat = 14; // raw16khz16bitmonopcm speechconfig speechsynthesisvoicename = azure voice let audioconfig = null; const synthesizer = new sdk speechsynthesizer(speechconfig, audioconfig); synthesizer speaktextasync( text = stripuneeqtags(text), result => { const { audiodata } = result; synthesizer close(); const bufferstream = new passthrough(); bufferstream end(buffer from(audiodata)); resolve(bufferstream); }, error => { synthesizer close(); reject(error); } ); }); }; const texttospeechssml = async (text) => { return new promise((resolve, reject) => { const speechconfig = sdk speechconfig fromsubscription(azure api key, azure region); speechconfig speechsynthesisoutputformat = 14; // raw16khz16bitmonopcm speechconfig speechsynthesisvoicename = azure voice; let audioconfig = null; // console log(speechconfig); const synthesizer = new sdk speechsynthesizer(speechconfig, audioconfig); let ssmltext = ""; if (nlp outputs ssml === "true") { ssmltext = stripuneeqtags(text) } else { ssmltext = '\<speak xmlns="http //www w3 org/2001/10/synthesis" xmlns\ mstts="http //www w3 org/2001/mstts" xmlns\ emo="http //www w3 org/2009/10/emotionml" version="1 0" xml\ lang="en us">' ssmltext += `\<voice name="${azure voice}">` if (prosody speed length > 0) { ssmltext += `\<prosody rate="${prosody speed}" pitch="${prosody pitch}">` } if (use speaking style === 'true') { ssmltext += `\<mstts\ express as style="${azure speaking style}">` } ssmltext += stripuneeqtags(text) if (use speaking style === "true") { ssmltext += '\</mstts\ express as>' } if (prosody speed length > 0) { ssmltext += "\</prosody>" } ssmltext += "\</voice>" ssmltext += "\</speak>" } synthesizer speakssmlasync( ssmltext, result => { console log(ssmltext); const { audiodata } = result; synthesizer close(); const bufferstream = new passthrough(); bufferstream end(buffer from(audiodata)); resolve(bufferstream); }, error => { console log(error) synthesizer close(); reject(error); } ); }); }; / this will strip all ssml (anything xml style) and leave only the text string itself / const stripuneeqtags = (text) => { text = text replace(/\<uneeq \w >/gm, ""); text = text replace(/<\\/uneeq \w >/gm, ""); text = text replace(/<\[^>] >/g, ""); return text } module exports = { texttospeech, texttospeechssml, stripuneeqtags }; uneeq configuration once your service has been deployed and confirmed reachable, please provide the following details to uneeq fully qualified url of the service api key for your service (optional) voice file / name (common to providers like google, amazon and azure your service may not require this detail, or you may choose to hardcode the value within your service uneeq will configure your digital human to use the custom voice service that you have created the next time you start a session with your digital human, you will hear the custom voice! troubleshooting this section lists common problems and errors that may occur during configuration or rollout of a byo tts endpoint if you don’t see your error here you might consider coming back and adding it once you figure it out the digital human is talking really slowly / quickly this is likely the sample rate, check that you are returning 16khz audio i can hear a click at the start of speech this is likely because there are unexpected bytes at the start of the response body previously this has been identified as being a wav file header if this is present it needs to be stripped prior to being returned all i hear is nasty violent static you have returned the audio in the wrong format this can happen just by swapping the byte order from little to big endian, or may be something more complicated like you are returning mp3 or some other audio format our application expects linear pcm, check the “returning audio” section above not playing well with nodejs when you’re using nodejs, you might default to res send(file) from express, but when streaming back a binary file, it’s better to use res sendfile(file) as this automatically handles a variety of things res sendfile(file) res sendfile(file) is specifically designed to send files as the response it automatically sets the appropriate headers, including content type based on the file extension, content disposition as attachment, and handles the streaming of the file efficiently it takes care of setting the appropriate headers, buffering, and sending the file in chunks, which is more memory efficient for large files it supports conditional requests (if modified since, if none match) and ranges (range, accept ranges) out of the box it provides better security by preventing the serving of files outside the specified directory (root directory) using path normalization res send(file) res send(file) is a generic method used to send various types of responses, including files however, it doesn't handle files in the same optimized manner as res sendfile(file) it requires manually setting the appropriate headers, including content type, content disposition, and handling the file streaming logic yourself it may load the entire file into memory before sending it, which can be memory intensive for large files and may impact server performance in summary, res sendfile(file) is the recommended approach when streaming files back as application/octet stream in express it provides better performance, memory efficiency, security, and handles file related functionalities more effectively

Multiple Digital Humans

Digital Human Behaviour