AI Voice Assistant

May 2025
AI Voice Assistant
AI Voice Assistant
AI Voice Assistant

Objective

Tools & Technologies

React (frontend UI), Vercel (frontend hosting), Firebase Functions (backend logic), Firestore (data storage and referencing), OpenAI API (natural language response generation), Deepgram (real-time audio transcription), Open Source VAD (voice activity detection and capture), Browser-native TTS (text-to-speech output)

Challenge

The core challenge of building this AI voice assistant was bringing together several complex technologies, each with their own quirks, into a seamless, human-like conversation experience inside a browser.

From the start, the idea was ambitious: allow users to "call" a virtual customer support agent, speak naturally, and receive intelligent, voiced replies in real time as if it were a human being they were speaking to. But achieving that required solving problems at every level of the stack.

First came audio capture. I'd never worked with voice activity detection (VAD) before, so identifying when a user had finished speaking, without manual button presses, involved stitching together open-source VAD components and testing how browsers handle microphone streams across devices. On top of that, it was a challenge to ensure the microphone didn't pick up the AI's responses, resulting in an endless loop of the agent speaking to itself.

Then came speech-to-text (STT). I used Deepgram's API for its low-latency, streaming transcription. But raw transcripts aren't enough for an intelligent system. An LLM has to be able to process the entire conversation up to that point, generate a list of pending tasks, and execute them with accurate responses. Designing that transcript pipeline to support follow-ups, multiple requests, and natural back-and-forth interactions was a key part of the system's intelligence.

Generating responses via ChatGPT-3.5 Turbo was familiar territory, but delivering those responses as spoken audio was brand new. I used the browser's native text-to-speech (TTS) for speed and simplicity, but coordinating TTS playback with the rest of the conversation state required careful control, flow, and state management in React.

I decided to do all of this without a traditional backend, so I leaned on Firebase Functions for on-demand logic, and Firestore for lightweight session memory, logging, and user-specific document storage. This laid the groundwork for real workflows like document emailing or referencing user files mid-conversation.

Throughout the project, the biggest challenge wasn't just integrating APIs, it was designing a UX that felt conversational, responsive, and human despite the technical complexity happening behind the scenes. Stitching audio, transcripts, logic, and playback together into something fluid and helpful was the true engineering test and incredibly rewarding to pull off.