Generating speech with different rhythms and pauses makes it sound more human-like, according to an assessment of an artificial intelligence trained on speech taken from YouTube and podcasts.
Most artificial intelligence text-to-speech systems are trained on data sets of acted speech, which can lead to the output sounding stilted and one-dimensional. More natural speech often displays a wide range of rhythms and patterns to convey different meanings and emotions.
Now, Alexander Rudnicky at Carnegie Mellon University in Pittsburgh, Pennsylvania, …