Voice-driven document interactions with OpenAI

Nick Winder

November 15, 2024

Voice-driven document interactions with OpenAI

Imagine the frustration of sifting through countless pages, looking for a crucial piece of information — a feeling familiar to anyone who has ever dealt with overwhelming volumes of documents.

What if I told you that current technology could help solve this issue? Not only that, but you can ask for information using your voice, just like you’d ask Jeff from accounting about the numbers.

AI is transforming the way we interact with information. It began with us mindlessly chatting with our favorite chatbots, but now AI is moving toward intuitive, voice-driven interaction that promises to simplify and enhance our work — often making it more fun.

OpenAI’s latest voice model and Realtime API(opens in a new tab) is at the forefront of this shift, offering capabilities such as understanding nuanced emotions and contextual meaning, which promise to change our relationship with documents forever.

Gone are the days of time-consuming manual search and summarization. We’ve entered a new, dynamic, and intuitive communication era where you can ask questions and receive relevant answers.

Enhancing accessibility and convenience

For many, voice interaction offers a more natural way of expressing thoughts because it closely mirrors everyday human communication. Speaking allows thoughts to flow freely without the need to pause and think about spelling, grammar, or punctuation. It’s a direct path from thought to expression, making it easier to convey complex ideas and emotions.

Voice-driven interfaces also transform the computing experience for individuals with impairments, allowing them to complete tasks more quickly and efficiently, further expanding the accessibility of tools beyond the use of a mouse and keyboard.

With the advancement of technology, we can imagine useful voice-driven document workflows. Users should be able to ask questions about content and receive responses in natural language tailored to their preferred format, style, and vibe. This approach can significantly shorten feedback loops and boost productivity across various settings, allowing users to quickly access the information they need without losing the time typically necessary to adapt to an interface. Instead, the interface adapts to them.

Enough talk — Let me see a demo

You didn’t read all that for nothing, did you?

Here at Nutrient, we’ve been experimenting with the future of voice-driven document interaction. It’s an exciting yet tricky solution to perfect, but you deserve to see the progress.

We’ve been able to drive the model to demonstrate a strong understanding of both a document’s context and content. Through careful design, we’ve managed to provide relevant information precisely when needed while minimizing unnecessary context, ensuring more targeted responses.

It’s not just about understanding though — we need to give credit where it’s due: OpenAI’s Realtime API introduces a new level of human-like interaction and efficiency. Features like voice activity detection (VAD) enable users to interject during a response, refining their questions or guiding the conversation in the desired direction. While similar adjustments are possible with text interaction, they often require multiple clicks and extensive typing. Voice input, on the other hand, is faster and more direct, allowing users to reach solutions more quickly and efficiently.

Another standout feature that sets voice interaction apart from text is the model’s ability to interpret tone, speed, and nonverbal cues. When frustration creeps into your voice, the responses can adjust accordingly, offering a more empathetic and adaptive interaction. While text can convey emotion (like typing in ALL CAPS), vocal tone and non-verbal signals add a richer, more nuanced dimension to the exchange. The concept is simple: Speak, act, and express yourself as you want to be responded to. Just be mindful of what you wish for! :)

That all sounds amazing, doesn’t it?

We agree! However, we’re not yet convinced it’s ready for mainstream adoption.

Price is the limiting factor

So, what’s holding us back from releasing this feature today?

The simple answer is cost — it’s simply too expensive.

In our tests, the costs reached around $30 per hour due to extensive tool usage and continuous voice streaming. In contrast, using our current AI Assistant product with a model like GPT-4o mini for basic interactions could cost as little as $0.50 per hour. To make OpenAI’s Realtime API effective, you’d need a compelling and highly profitable use case.

Advanced models like these require significant computational resources, driving costs up. However, as we’ve seen with other innovations, the trajectory is often one of rapid cost reduction over time. In the coming months and years, these expenses could decrease significantly, paving the way for broader adoption and making voice interaction with documents a reality for everyone.

But now, it’s over to you. Do you have a use case where, despite the high cost of the service, the return on investment is so substantial that it’s well worth the price?

Well, maybe you can encourage us to push this further. If you’re interested, contact our Sales team.