Emotions in the Machine – Part I: Testing Dia by Nari Labs for Game Voiceovers

An illustration of a futuristic and holographic computer interface.

As we approach the release of The Green Spurt, our immersive escape room-style game for Apple Vision Pro, we were working on the voiceover for the game's ending, a 30-40 seconds audio that needed emotion, strangeness, and the kind of unpredictable texture only a human voice usually delivers.

The Problem with current Text-to-Speech (TTS) models

So far, we have been using ElevenLabs for our voiceovers needs. It's fast, stable, packed with voices and offers a wide support for languages. And while it does a good job responding to emotional cues in text, it's very difficult to get those little imperfections and quirks that make a voice really land.

Enter Dia: A new open-source TTS model with raw emotion

I first heard about Dia through the Near Future Laboratory newsletter, and it immediately caught my attention:

Dia is a 1.6B parameter text to speech model created by Nari Labs. Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc. - Nari Labs on Github

Looking more into it, they offer a demo page comparing Dia to ElevenLabs and Sesame CSM-1B. I usually take those comparisons with a grain of salt, but it was impressive enough to make me want to try it.

If you are curious about the team behind Dia, how they pulled it off in 3 months and a glimpse into the world's reaction to it, you can check out Toby Kim's (one of the creators) thread on X:


The Script and The Requirements

⚠️
The Green Spurt ending spoiler alert.

This short voiceover comes at the very end of The Green Spurt, after players (in the role of Watchers) successfully restore the RELEAF BioVault’s air filtration system. It's voiced by the Chief Engineer, who can’t help but take a little credit for their success. This is one of two possible endings of the game, depending on the players’ final decision.

I won’t go into the full backstory, but here is what matters in terms of requirements: the voiceover needs a neutral-sounding voice that carries a mix of dramatic flair and light comedic timing, something you’d expect from a real actor playing it. Here is the script I used for the test:

“Hey everyone, RELEAF Chief Engineer here—coming to you live from the Tropico BioVault.
Quick room tour!
Sooo… the one time I take a vacation after five years, of course there is a critical air filtration failure.
But don’t worry, it’s all sorted now! The Watchers totally crushed it—thanks to my excellent training and remote support the entire time.
Anyway, clean air again. You’re welcome!
Back to my coconuts.
Byeee.”

Test Run: Dia vs. ElevenLabs

Dia:

audio-thumbnail
Chief Engineer Dia test
0:00
/24.975260770975055

Max New Tokens (Audio Length): 3072; CFG Scale (Guidance Strength): 3; Temperature (Randomness): 1.5; Top P (Nucleus Sampling): 0.95; CFG Filter Top K: 30; Speed Factor: 0.94. No Audio Prompt.

ElevenLabs:

audio-thumbnail
Chief Engineer ElevenLabs test
0:00
/30.093061224489794

Voice: Ariah; Model: Eleven Multilingual v2; Speed: 1; Stability: 30%; Similarity: 60%; Style: 100%; Speaker boost: Enabled.

Okay, so first impressions on Dia. The generated voice speaks too quickly, it reminds me of those super fast disclaimers at the end of TV ads. The output repeats part of the script, skips the final word entirely, and ends with a few seconds of silence. Also, the voice didn't quite match what I had in mind, with currently no library of voices available. That said, Dia nailed the emotional delivery. It felt more alive, with real variation and unpredictability.

On the other hand, ElevenLabs was far more stable. I could pick a voice, and what came out was clean and predictable. But in this case, maybe too predictable. As expected, the deal breaker is that it lacks the emotional nuance I am looking for. As I have some experience with it, I know there are tricks to improve the results in terms of emotion, but nothing close to what Dia is able to deliver right away.

Given the needs of this voiceover, and the spirit of The Green Spurt itself, which began as a playground for trying out new technologies, I decided to dive deeper into Dia. I wanted to understand its limits better and figure out how to work around them.

In Part II, I will walk you through how I generated the final voiceover, share some usage tips, insights from the Dia Discord community, and more. Stay tuned!

🚀
The Green Spurt is currently in development.
The Beta release is now available and you can register your interest for early access here.

With immersive regards from my digital persona!

Roxana Nagy

Roxana Nagy

Co-Founder & Creative Technologist at Reality check