It was 2022. While most people were still trying to wrap their heads around ChatGPT, two young minds were already deep in a Google Discord chat, discussing GPUs and AI infrastructure. What started as a casual conversation about hardware would eventually birth India's first speech-to-speech foundational model.
Tanusri, navigating her third year at Osmania University, had already tasted entrepreneurial blood. She wasn't just another computer science student—she was transforming entire sales processes for businesses, implementing AI systems that revolutionized customer interactions. Within two weeks of launching her first AI venture, she had generated $2,500 in monthly recurring revenue.
On the other side of that Discord chat was Vansh, a second-year student at IIT Jodhpur with an extraordinary track record. His story begins at age 12, when he wanted to use WhatsApp but didn't have his own phone number. Instead of waiting, he built his own social platform—Vispark Chat launched on January 1, 2020. By 2021, he was training AI models that could have real conversations, launching Vision AI on Telegram before most people had heard of generative AI.
As their conversations evolved, Tanusri and Vansh noticed something that bothered them deeply: every voice AI system felt broken. You'd talk to a voice bot, wait two seconds, get a robotic response, and accidentally interrupt because the timing was wrong.
The problem was structural. Every system relied on three steps: convert speech to text, process with a language model, then convert back to speech. Each step added delay and stripped away human nuances—the emotion, and subtle tones that make conversations feel alive.
Vansh, who had been training speech models since 2021 and building multimodal AI foundational models while tech giants were still racing to figure out the basics, knew there had to be a better way. At an age when most were still in college, he had already built AI systems that could seamlessly understand voice, text, and visual inputs—foundational work that would prove crucial. "We realized that the third dimension of voice—the tonality, the emotional texture, the rhythm—was being lost in translation. That's when we knew we had to build something completely different."
What happened next defied conventional wisdom. While tech giants threw billions and thousands of engineers at similar problems, these two students built a speech-to-speech foundational model from scratch—a 250 billion parameter system that understands and generates speech directly, without converting to text.
““we are outpacing OpenAI, Google, and Meta with a better model, built on far less compute and at a fraction of the cost.””
But they had something the big companies didn't: necessity and constraint. Working with limited resources in India, they had to be efficient and creative. Vansh's technical genius combined with Tanusri's business acumen created something unique—a model that didn't just work, but learned from every conversation.
By July 2025, they had achieved the impossible: a speech-to-speech model that holds natural, emotional conversations in over 200 languages. Their model, code-named Rose, captures the art of human conversation—knowing when to pause, show empathy, or be assertive. It learns from every interaction, becoming more human with each call.
"The moment we realized we had succeeded," Vansh explains, "was when we stopped thinking about technology and started thinking about conversations. Our model wasn't just responding—it was connecting."
"We're not just building voice AI, we're creating the third dimension of human-machine conversation. While others focus on words, we capture the soul of speech: the tonality, the pauses, the emotional nuances that make us human.
Our AI doesn't just respond; it learns, gains experience, and grows wiser with every conversation, just like we do. We're not making machines sound human, we're making them understand what it means to be human."
— Tanusri & Vansh, Co-founders of AIVoco