Google Text-to-Speech

Using Google Text-to-Speech with Python without saving to a file.

C05348A3-9AB8-42C9-A6E0-81DB3AC59FEB
           

The Google Text-to-Speech engine is still by far the best and most natural-sounding voice available out there. While there are many examples of how to use the TTS library with Python, all of them assume that you want to save the resulting audio clip to an .mp3 file so you can play it.

Saving the file to play the sound and then delete it doesn't seem like a big deal, but on a Raspberry Pi using an SD card with a limited number of read and write cycles, just doesn't sound like a great idea. I was therefore looking for a solution to play the audio stream, and this is the most reliable way I found to do so.

This code assumes that you already have registered for the Google Text-to-Speech service and that you have obtained the appropriate GOOGLE_APPLICATION_CREDENTIALS .json file. Refer to the Google Documentation on how to do so.

The Pygame libraries will also be used to run this script, please also refer to their documentation on how to install it on your system.

from google.cloud import texttospeech
import pygame
import io
import os
os.putenv('DISPLAY', ':0.0')
os.environ["SDL_VIDEODRIVER"] = "dummy"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.expanduser("~/<your_credential_file>.json")

# Instantiates a client
client = texttospeech.TextToSpeechClient()
# response = client.list_voices()
# print(response)

# Set the text input to be synthesized
synthesis_input = texttospeech.types.SynthesisInput(text="Knock knock! Why did the duck cross the road?")

# Build the voice request, select the language code ("en-US") and the ssml
voice = texttospeech.types.VoiceSelectionParams(
    language_code='en-US',
    name='en-US-Wavenet-F',
    ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)

# Select the type of audio file you want returned
audio_config = texttospeech.types.AudioConfig(
    audio_encoding=texttospeech.enums.AudioEncoding.MP3)

# Perform the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response = client.synthesize_speech(synthesis_input, voice, audio_config)

# # The response's audio_content is binary.
# with open('output.mp3', 'wb') as out:
#     # Write the response to the output file.
#     out.write(response.audio_content)
#     print('Audio content written to file "output.mp3"')

freq = 24000    # audio CD quality
bitsize = -16   # unsigned 16 bit
channels = 2    # 1 is mono, 2 is stereo
buffer = 2048   # number of samples (experiment to get right sound)
pygame.mixer.init(freq, bitsize, channels, buffer)
pygame.mixer.music.set_volume(1.0)
pygame.display.set_mode((1, 1))

pygame.mixer.init()
pygame.init()  # this is needed for pygame.event.* and needs to be called after mixer.init() otherwise no sound is played
with io.BytesIO() as f:  # use a memory stream
    f.write(response.audio_content)
    f.seek(0)
    pygame.mixer.music.load(f)
    pygame.mixer.music.set_endevent(pygame.USEREVENT)
    pygame.event.set_allowed(pygame.USEREVENT)
    pygame.mixer.music.play()
    pygame.event.wait()  # play() is asynchronous. This wait forces the speaking to be finished before closing f and returning
    while pygame.mixer.music.get_busy():
        pygame.time.Clock().tick(10)

pygame.mixer.music.fadeout(1000)
pygame.mixer.music.stop()
Posted Comments: 0

Tagged with:
text-to-speech