Giving Machines A Human Ear: The Science Behind LEAP Lab's Voice-Driven AI Innovation
Sriram Ganapathy’s LEAP lab at IISc develops AI to interpret speech, advancing diagnostics, hearing aids, and human-machine communication.

By Anubha Jain
Published : June 18, 2025 at 7:54 PM IST
|Updated : June 18, 2025 at 8:02 PM IST
Bengaluru: In a bustling lab at the Indian Institute of Science (IISc), Sriram Ganapathy is unravelling the complexities of human speech to redefine how machines understand us. Leading the Learning and Extraction of Acoustic Patterns (LEAP) lab, he and his team work on representation learning to develop compact, meaningful features from speech and acoustic signals, enabling AI to interpret voice, emotion, and biometric cues with human-like intuition.
From pioneering voice-based COVID-19 diagnostics with Coswara to shaping next-generation hearing aids and collaborating with tech giants like Google and Samsung, Sriram’s work is pushing the boundaries of human-machine communication. With a background in signal processing, Sriram pursued his PhD at Johns Hopkins University under Hynek Hermansky, developing techniques to extract reliable speech information in noisy environments. After working at IBM TJ Watson Research Centre, he returned to India and joined IISc in 2016. A sabbatical at Google DeepMind deepened Sriram’s engagement with large language models (LLMs), prompting research into emotion detection and LLM reliability.
His journey, sparked by a curiosity about how we speak and listen, is now bridging cognitive science, artificial intelligence, and real-world impact. In an exclusive interview with ETV Bharat, Sriram Ganapathy talked about his work, discussing human-machine communication, challenges, limitations, the extent of AI learning, adaptation, and more.
Advancing Seamless Human-Machine Communication through Speech Recognition and AI
When asked about the key experiences or turning points that influenced his decision to specialise in speech and acoustic signal processing, Sriram said that his journey was shaped by a deep curiosity about human communication and how machines could be taught to understand it. His early exposure to signal processing during his MTech at IISc laid the foundation, but it was his time working with Hynek Hermansky—first at EPFL in Switzerland and later at Johns Hopkins University—that truly ignited his interest.
"Seventeen years ago, during my MTech final year, my research focused on speech recognition, aiming to replicate human-machine interactions akin to Siri or Alexa. At the time, this was a significant technical challenge. We primarily trained our systems on large datasets from American and British English speakers and then worked to adapt these models to the Indian accent. The core objective was to generalise the model’s understanding across different speech patterns and accents, particularly in Indian English, to enable more inclusive and effective speech-based interfaces.”
Speech: The Natural Interface of Communication
Talking more about the subject and scientific challenges in extracting meaningful patterns from human speech, Sriram said that human communication is predominantly verbal. As such, the next generation of intelligent systems must be capable of engaging in seamless spoken interactions with humans.
"Our research aims to build machines that are not only intelligent but also human-like in their ability to understand and respond to speech. This involves two primary processes—speech recognition (understanding spoken input) and speech synthesis (generating spoken responses)," Sriram said. "Beyond these, machines must incorporate contextual knowledge, understand the environment, and adapt to the background of each speaker. This requires leveraging large language models (LLMs) like ChatGPT, Claude, and others."
These models are primarily trained on text but are increasingly being used to bridge various communication modalities—text, voice, image, and even emotion—towards creating informed, adaptive, and conversational systems, he added.
Challenges in Real-world Environments: Noisy, Personal, and Diverse
“AI systems are nearing human-like capabilities, but understanding speech remains a unique challenge,” Sriram explained. "Unlike written text, speech is shaped by pronunciation, language proficiency, emotional state, and environmental noise. These variations carry essential information but also pose significant challenges for machines."
"Furthermore, speech also encodes identity and health signals—voice can act as a biometric marker. This opens the door for AI systems to verify a speaker’s identity or detect health conditions based on vocal traits, such as respiratory issues or emotional stress," he added.
He further explained that this broader scope of speech analysis goes beyond content to include paralinguistics—analysing who is speaking, how they are speaking, and why. Elements like health, emotional state, regional dialect, and environmental conditions are critical. The research at the LEAP lab seeks to capture and process all these dimensions to build machines capable of truly human-like understanding and response.

"At IISc, we collaborate with government agencies, startups, and global industry leaders like Google, Samsung, and Sony. One major challenge is the lack of labelled, real-world data in underrepresented languages or contexts. During the COVID-19 pandemic, for example, when no precedent data existed, our team developed a mobile-based system to collect voice data from around the world, guiding users remotely. This data, stored in the cloud, was analysed and shared with researchers and the public to contribute to early-stage COVID-19 voice-based diagnostics," he said.
Improving AI Learning
When asked how understanding the brain’s interpretation of speech informs AI development, Sriram explained that LEAP focuses on decoding human speech and communication signals. The lab draws insights from human cognition, especially in complex auditory scenes where humans naturally focus on one voice over another—something machines still struggle with. LEAP’s work explores these behavioural mechanisms, including how we differentiate and attend to specific voices, to improve AI listening systems.
Adapting to the Age of Foundation Models and LLMs
Discussing the understanding of LLMs (Large Language Models) influenced his lab’s research priorities, Sriram said, “My sabbatical at Google DeepMind in 2022 marked a pivotal moment. Within months of joining, we witnessed the rapid evolution of LLMs. Initially just a growing curiosity, these models soon began revolutionising image generation, conversational AI, and language understanding. I observed firsthand how foundational models became the core of diverse applications—from chatbots to legal and medical advisory systems.”
LLM Limitations: Emotion Detection, Language Bias
Emotion detection from speech is notoriously difficult. Emphasising LLMs' struggle, he said that LLMs now aim to provide not just answers but confidence-based responses, crucial for high-stakes fields like healthcare, law, and finance. Yet, challenges remain, especially in emotion detection.
Emotions are deeply personal and context-dependent, making it difficult for AI systems to interpret them correctly from isolated speech samples. Current models can detect broad emotional categories like “positive” or “negative,” but fail to distinguish fine-grained states such as fear, frustration, or joy. At LEAP, we’re working to build datasets and systems that can parse this complexity.
Sriram mentioned that one of the major limitations of today’s LLMs is their language bias. Most are trained predominantly on English or Western data. Their performance degrades significantly in regional Indian languages. This poses a critical challenge in a multilingual country like India. Bridging this gap requires culturally and linguistically representative datasets, and models that can understand Indian languages and dialects with the same fluency as English. Besides that, modern LLMs also suffer from hallucinations. LEAP lab is now working on addressing these reliability issues as well.
The Path Ahead: From Commercial Chatbots to Biomarkers
Discussing long-term goals and the potential for similar acoustic diagnostic tools beyond COVID-19—such as in mental health or chronic respiratory conditions—he noted that while commercial voice assistants are widely used, the true future of voice technology lies in its diagnostic and therapeutic capabilities.
Speech contains rich biological data. From vocal strain to rhythm and tone, voice traits can reveal cardiovascular, respiratory, and neurological conditions. Mental health indicators like depression and anxiety are also detectable in speech patterns. Our long-term vision includes using voice recordings not just for interaction, but also as biomarkers for health and wellness. Voice-based interfaces could become essential tools, especially for populations where typing or reading is a barrier.
“We are at a transformational point where machines are approaching human-level capabilities in understanding and responding to speech. But the journey ahead involves more than refining algorithms—it involves deep integration of cognitive science, linguistics, health science, and human-centric design," he said. "At LEAP Lab, we’re not just building smarter machines; we’re aiming for empathetic, context-aware, and linguistically inclusive systems that truly understand the people they serve.”

