Mohd Mujtaba Akhtar
Research Associate @ IIIT-Delhi | mmakhtar.research@gmail.com
Hello! I am Mohd Mujtaba Akhtar. My core research interests revolve around speech and audio processing, with a particular focus on audio deepfake detection, emotional speech understanding, and the application of multimodal and foundation models in both behavioral and forensic domains.
I am currently working on audio deepfake detection using transfer learning from foundation models as part of my thesis work.
I am actively seeking research collaboration opportunities in areas such as speech/audio deepfake detection, audio-visual deepfake detection, speech emotion recognition, affective computing, and speech-driven healthcare applications.
🤝 I'm always interested in collaborating on exciting projects! If you have an idea or opportunity in mind, feel free to reach out. You can contact me at mmakhtar.research@gmail.com.
I am proactively seeking PhD opportunities for Fall 2026 with a commitment to advancing research excellence. I welcome opportunities to collaborate with faculty and research groups aligned with my expertise.
I am passionate about pushing the boundaries of what is possible with speech and audio technology.
Research Interests
- Geometric Deep Learning
- Computational Speech Analysis
- Deepfake Forensics
- Emotional Speech Understanding
- Healthcare Applications (Speech)
- Multimodal/Foundation Models
💡News
- Oct 2025 2 papers accepted at IJCNLP 2025 as first author.
- Aug 2025 4 papers accepted at APSIPA ASC 2025 as first author.
- June 2025 7 papers accepted at INTERSPEECH 2025 (6 first co-authored.)
- June 2025 2 papers accepted at EUSIPCO 2025 as first author.
- June 2025 1 paper accepted at ICASSP 2025 as second author.
Selected Publications
-
INTERSPEECH 2025
SNIFR: Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual
Alignment with Cascaded Cross-TransformerOrchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama, Sarthak Jain, Priyabrata Mallick, Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFAs video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art. Index Terms: Child Unsafe Content, Multimodal Learning, Cross-Transformer
-
EUSIPCO 2025
Are Mamba-based Audio Foundation Models the Best Fit for Non-Verbal Emotion Recognition?
Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Girish, Swarup Ranjan Behera, Ananda Chandra Nayak, Sanjib Kumar Nayak, Arun Balaji Buduru, Rajesh Sharma
EUSIPCO 2025 PDFIn this work, we focus on non-verbal vocal sounds emotion recognition (NVER). We investigate mamba-based audio foundation models (MAFMs) for the first time for NVER and hypothesize that MAFMs will outperform attention-based audio foundation models (AAFMs) for NVER by leveraging its state-space modeling to capture intrinsic emotional structures more effectively. Unlike AAFMs, which may amplify irrelevant patterns due to their attention mechanisms, MAFMs will extract more stable and context-aware representations, enabling better differentiation of subtle non-verbal emotional cues. Our experiments with state-of-the-art (SOTA) AAFMs and MAFMs validates our hypothesis. Further, motivated from related research such as speech emotion recognition, synthetic speech detection, where fusion of foundation models (FMs) have showed improved performance, we also explore fusion of FMs for NVER. To this end, we propose, RENO, that uses renyidivergence as a novel loss function for effective alignment of the FMs. It also makes use of self-attention for better intrarepresentation interaction of the FMs. With RENO, through the heterogeneous fusion of MAFMs and AAFMs, we show the topmost performance in comparison to individual FMs, its fusion and also setting SOTA in comparison to previous SOTA work. Index Terms—Non-Verbal Emotion Recognition, Mambabased Audio Foundation Models, Attention-based Audio Foundation Models
-
ICASSP 2025
Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport
for Non-Verbal Emotion RecognitionOrchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S. R. Mahadeva Prasanna
ICASSP 2025 PDFIn this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model (FM) representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.Index Terms: Non-Verbal Emotion Recognition, Multimodal Foundation Models, LanguageBind, ImageBind
PHASE-III, NEW DELHI, 110020, INDIA
Conference publications
-
Curved Worlds Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces
Farhan Sheth*, Girish*, Mohd Mujtaba Akhtar*, Muskaan Singh
IJCNLP 2025 PDFIn this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms—including conventional text-to- speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to gener- alize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its gen- erative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds—where hyperbolic geometry excels at modeling hierarchical generator families, and spherical projections capture angular, energy-invariant cues such as periodic vocoder artifacts. The fused representation is obtained via Riemannian barycentric averaging, enabling synthesis- invariant alignment. RHYME outperforms individual PTMs and homogeneous fusion baselines, achieving top performance and setting new state-of-the-art in cross-paradigm ADD
-
Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning
Girish*, Mohd Mujtaba Akhtar*, Farhan Sheth, Muskaan Singh
IJCNLP 2025 PDFIn this work, we address the problem of fine- grained traceback of emotional and manipula- tion characteristics from synthetically manipu- lated speech. We hypothesize that combining semantic–prosodic cues captured by Speech Foundation Models (SFMs) with fine-grained spectral dynamics from auditory representa- tions can enable more precise tracing of both emotion and manipulation source. To vali- date this hypothesis, we introduce MiCuNet, a novel multitask framework for fine-grained tracing of emotional and manipulation at- tributes in synthetically generated speech. Our approach integrates SFM embeddings with spectrogram-based auditory features through a mixed-curvature projection mechanism that spans Hyperbolic, Euclidean, and Spherical spaces guided by a learnable temporal gat- ing mechanism. Our proposed method adopts a multitask learning setup to simultaneously predict original emotions, manipulated emo- tions, and manipulation sources on the Emo- Fake dataset (EFD) across both English and Chinese subsets. MiCuNet yields consistent improvements, consistently surpassing conven- tional fusion strategies. To the best of our knowledge, this work presents the first study to explore a curvature-adaptive framework specifi- cally tailored for multitask tracking in synthetic speech.
-
PARROT: Synergizing Mamba and Attention-based SSL Pre-Trained Models via Parallel Branch Hadamard Optimal Transport for Speech Emotion Recognition
Orchid Chetia Phukan*, Mohd Mujtaba Akhtar*, Girish*, Swarup Ranjan Behera, Sai Kiran Patibandla, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFThe emergence of Mamba as an alternative to attention-based architectures has led to the development of Mamba-based selfsupervised learning (SSL) pre-trained models (PTMs) for speech and audio processing. Recent studies suggest that these models achieve comparable or superior performance to state-of-the-art (SOTA) attention-based PTMs for speech emotion recognition (SER). Motivated by prior work demonstrating the benefits of PTM fusion across different speech processing tasks, we hypothesize that leveraging the complementary strengths of Mambabased and attention-based PTMs will enhance SER performance beyond the fusion of homogenous attention-based PTMs. To this end, we introduce a novel framework, PARROT that integrates parallel branch fusion with Optimal Transport and Hadamard Product. Our approach achieves SOTA results against individual PTMs, homogeneous PTMs fusion, and baseline fusion techniques, thus, highlighting the potential of heterogeneous PTM fusion for SER. Index Terms: Speech Emotion Recognition, Pre-Trained Models, Mamba-based Models, Attention-based Models
-
SNIFR: Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer
Orchid Chetia Phukan, Mohd Mujtaba Akhtar*, Girish*, Swarup Ranjan Behera, Abu Osama, Sarthak Jain, Priyabrata Mallick, Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFAs video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art. Index Terms: Child Unsafe Content, Multimodal Learning, Cross-Transformer
-
HYFuse: Aligning Heterogeneous Speech Pre-Trained Representations in Hyperbolic Space for Speech Emotion Recognition
Orchid Chetia Phukan*, Girish*, Mohd Mujtaba Akhtar*, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFCompression-based representations (CBRs) from neural audio codecs such as EnCodec capture intricate acoustic features like pitch and timbre, while representation-learning-based representations (RLRs) from pre-trained models trained for speech representation learning such as WavLM encode high-level semantic and prosodic information. Previous research on Speech Emotion Recognition (SER) has explored both, however, fusion of CBRs and RLRs haven’t been explored yet. In this study, we solve this gap and investigate the fusion of RLRs and CBRs and hypothesize they will be more effective by providing complementary information. To this end, we propose, HYFuse, a novel framework that fuses the representations by transforming them to hyperbolic space. With HYFuse, through fusion of x-vector (RLR) and Soundstream (CBR), we achieve the top performance in comparison to individual representations as well as the homogeneous fusion of RLRs and CBRs and report SOTA. Index Terms: Speech Emotion Recognition, Pre-Trained Models, Neural Audio Codec, Representations
-
Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction
Orchid Chetia Phukan*, Girish*, Mohd Mujtaba Akhtar*, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFIn this study, we focus on Singing Voice Mean Opinion Score (SingMOS) prediction. Previous research have shown the performance benefit with the use of state-of-the-art (SOTA) pre-trained models (PTMs). However, they haven’t explored speaker recognition speech PTMs (SPTMs) such as x-vector, ECAPA and we hypothesize that it will be the most effective for SingMOS prediction. We believe that due to their speaker recognition pretraining, it equips them to capture fine-grained vocal features (e.g., pitch, tone, intensity) from synthesized singing voices in a much more better way than other PTMs. Our experiments with SOTA PTMs including SPTMs and music PTMs validates the hypothesis. Additionally, we introduce a novel fusion framework, BATCH that uses Bhattacharya Distance for fusion of PTMs. Through BATCH with the fusion of speaker recognition SPTMs, we report the topmost performance comparison to all the individual PTMs and baseline fusion techniques as well as setting SOTA. Index Terms: SingMOS, Pre-Trained Models, Speaker Recognition Pre-Trained Models
-
Towards Machine Unlearning for Paralinguistic Speech Processing
Orchid Chetia Phukan*, Girish*, Mohd Mujtaba Akhtar*, Shubham Singh, Swarup Ranjan Behera, Vandana Rajan, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFIn this work, we pioneer the study of Machine Unlearning (MU) for Paralinguistic Speech Processing (PSP). We focus on two key PSP tasks: Speech Emotion Recognition (SER) and Depression Detection (DD). To this end, we propose, SISA++, a novel extension to previous state-of-the-art (SOTA) MU method, SISA by merging models trained on different shards with weightaveraging. With such modifications, we show that SISA++ preserves performance more in comparison to SISA after unlearning in benchmark SER (CREMA-D) and DD (E-DAIC) datasets. Also, to guide future research for easier adoption of MU for PSP, we present “cookbook recipes” - actionable recommendations for selecting optimal feature representations and downstream architectures that can mitigate performance degradation after the unlearning process. Index Terms: Machine Unlearning, Paralinguistic Speech Processing, Speech Emotion Recognition, Depression Detection
-
Towards Fusion of Neural Audio Codec-based Representations with Spectral for Heart Murmur Classification via Bandit-based Cross-Attention Mechanism
Orchid Chetia Phukan*, Girish*, Mohd Mujtaba Akhtar*, Swarup Ranjan Behera, Priyabrata Mallick, Santanu Roy, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFIn this study, we focus on heart murmur classification (HMC) and hypothesize that combining neural audio codec representations (NACRs) such as EnCodec with spectral features (SFs), such as MFCC, will yield superior performance. We believe such fusion will trigger their complementary behavior as NACRs excel at capturing fine-grained acoustic patterns such as rhythm changes, spectral features focus on frequency-domain properties such as harmonic structure, spectral energy distribution crucial for analyzing the complex of heart sounds. To this end, we propose, BAOMI, a novel framework banking on novel banditbased cross-attention mechanism for effective fusion. Here, a agent provides more weightage to most important heads in multihead cross-attention mechanism and helps in mitigating the noise. With BAOMI, we report the topmost performance in comparison to individual NACRs, SFs, and baseline fusion techniques and setting new state-of-the-art. Index Terms: Heart Murmur Classification, Neural Audio Codecs, Spectral Features
-
Towards Source Attribution of Singing Voice Deepfake with Multimodal Foundation Models
Orchid Chetia Phukan*, Girish*, Mohd Mujtaba Akhtar*, Swarup Ranjan Behera, Priyabrata Mallick, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
INTERSPEECH 2025 PDFIn this work, we introduce the task of singing voice deepfake source attribution (SVDSA). We hypothesize that multimodal foundation models (MMFMs) such as ImageBind, LanguageBind will be most effective for SVDSA as they are better equipped for capturing subtle source-specific characteristics—such as unique timbre, pitch manipulation, or synthesis artifacts of each singing voice deepfake source due to their crossmodality pre-training. Our experiments with MMFMs, speech foundation models and music foundation models verify the hypothesis that MMFMs are the most effective for SVDSA. Furthermore, inspired from related research, we also explore fusion of foundation models (FMs) for improved SVDSA. To this end, we propose a novel framework, COFFE which employs Chernoff Distance as novel loss function for effective fusion of FMs. Through COFFE with the symphony of MMFMs, we attain the topmost performance in comparison to all the individual FMs and baseline fusion methods. Index Terms: Source Attribution, Singing Voice Deepfake, Deepfake Detection
-
Are Mamba-based Audio Foundation Models the best fit for non-verbal emotion recognition?
Mohd Mujtaba Akhtar*, Orchid Chetia Phukan*, Girish*, Swarup Ranjan Behera, Ananda Chandra Nayak, Sanjib Kumar Nayak, Arun Balaji Buduru, Rajesh Sharma
EUSIPCO 2025 PDFIn this work, we focus on non-verbal vocal sounds emotion recognition (NVER). We investigate mamba-based audio foundation models (MAFMs) for the first time for NVER and hypothesize that MAFMs will outperform attention-based audio foundation models (AAFMs) for NVER by leveraging its state-space modeling to capture intrinsic emotional structures more effectively. Unlike AAFMs, which may amplify irrelevant patterns due to their attention mechanisms, MAFMs will extract more stable and context-aware representations, enabling better differentiation of subtle non-verbal emotional cues. Our experiments with state-of-the-art (SOTA) AAFMs and MAFMs validates our hypothesis. Further, motivated from related research such as speech emotion recognition, synthetic speech detection, where fusion of foundation models (FMs) have showed improved performance, we also explore fusion of FMs for NVER. To this end, we propose, RENO, that uses renyidivergence as a novel loss function for effective alignment of the FMs. It also makes use of self-attention for better intrarepresentation interaction of the FMs. With RENO, through the heterogeneous fusion of MAFMs and AAFMs, we show the topmost performance in comparison to individual FMs, its fusion and also setting SOTA in comparison to previous SOTA work. Index Terms: Non-Verbal Emotion Recognition, Mambabased Audio Foundation Models, Attention-based Audio Foundation Models
-
Source Tracing of Synthetic Speech Systems Through Paralinguistic Pre-Trained Representations
Girish*, Mohd Mujtaba Akhtar*, Orchid Chetia Phukan*, Drishti Singh*, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma
EUSIPCO 2025 PDFIn this work, we focus on source tracing of synthetic speech generation systems (STSGS). Each source embeds distinctive paralinguistic features—such as pitch, tone, rhythm, and intonation—into their synthesized speech, reflecting the underlying design of the generation model. While previous research has explored representations from speech pre-trained models (SPTMs), the use of representations from SPTM pretrained for paralinguistic speech processing, which excel in paralinguistic tasks like synthetic speech detection, speech emotion recognition has not been investigated for STSGS. We hypothesize that representations from paralinguistic SPTM will be more effective due to its ability to capture source-specific paralinguistic cues attributing to its paralinguistic pre-training. Our comparative study of representations from various SOTA SPTMs, including paralinguistic, monolingual, multilingual, and speaker recognition, validates this hypothesis. Furthermore, we explore fusion of representations and propose TRIO, a novel framework that fuses SPTMs using a gated mechanism for adaptive weighting, followed by canonical correlation loss for inter-representation alignment and self-attention for feature refinement. By fusing TRILLsson (Paralinguistic SPTM) and x-vector (Speaker recognition SPTM), TRIO outperforms individual SPTMs, baseline fusion methods, and sets new SOTA for STSGS in comparison to previous works. Index Terms:Source Tracing, Paralinguistic Pre-Trained Models, Synthetic Speech Generators
-
Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport for Non-Verbal Emotion Recognition
Orchid Chetia Phukan, Mohd Mujtaba Akhtar*, Girish*, Swarup Ranjan Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S.R Mahadeva Prasanna
ICASSP 2025 PDFIn this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model (FM) representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.Index Terms: Non-Verbal Emotion Recognition, Multimodal Foundation Models, LanguageBind, ImageBind
-
Are Multimodal Foundation Models All That Is Needed for Emofake Detection?
Mohd Mujtaba Akhtar∗, Girish∗, Orchid Chetia Phukan∗, Swarup Ranjan Behera, Parabattina Bhagath, Pailla Balakrishna Reddy, Arun Balaji Buduru
APSIPA ASC 2025 PDFIn this work, we investigate multimodal foundation models (MFMs) for EmoFake detection (EFD) and hypothesize that they will outperform audio foundation models (AFMs). MFMs due to their cross-modal pre-training, learns emotional patterns from multiple modalities, while AFMs rely only on audio. As such, MFMs can better recognize unnatural emotional shifts and inconsistencies in manipulated audio, making them more effective at distinguishing real from fake emotional expressions. To validate our hypothesis, we conduct a comprehensive comparative analysis of state-of-the-art (SOTA) MFMs (e.g. LanguageBind) alongside AFMs (e.g. WavLM). Our experiments confirm that MFMs surpass AFMs for EFD. Beyond individual foundation models (FMs) performance, we explore FMs fusion, motivated by findings in related research areas such synthetic speech detection and speech emotion recognition. To this end, we propose SCAR, a novel framework for effective fusion. SCAR introduces a nested cross-attention mechanism, where representations from FMs interact at two stages sequentially to refine information exchange. Additionally, a self-attention refinement module further enhances feature representations by reinforcing important cross-FM cues while suppressing noise. Through SCAR with synergistic fusion of MFMs, we achieve SOTA performance, surpassing both standalone FMs and conventional fusion approaches and previous works on EFD.
-
Rethinking Cross-Corpus Speech Emotion Recognition Benchmarking: Are Paralinguistic Pre-Trained Representations Sufficient?
Orchid Chetia Phukan∗, Mohd Mujtaba Akhtar∗, Girish∗, Swarup Ranjan Behera, Pailla Balakrishna Reddy, Ananda Chandra Nayak, Sanjib Kumar Nayak, Arun Balaji Buduru
APSIPA ASC 2025 PDFRecent benchmarks evaluating pre-trained models (PTMs) for cross-corpus speech emotion recognition (SER) have overlooked PTM pre-trained for paralinguistic speech processing (PSP), raising concerns about their reliability, since SER is inherently a paralinguistic task. We hypothesize that PSP-focused PTM will perform better in cross-corpus SER settings. To test this, we analyze state-of-the-art PTMs representations including paralinguistic, monolingual, multilingual, and speaker recognition. Our results confirm that TRILLsson (a paralinguistic PTM) outperforms others, reinforcing the need to consider PSP-focused PTMs in cross-corpus SER benchmarks. This study enhances benchmark trustworthiness and guides PTMs evaluations for reliable cross-corpus SER.
-
Investigating Polyglot Speech Foundation Models for Learning Collective Emotion from Crowds
Orchid Chetia Phukan∗, Girish∗, Mohd Mujtaba Akhtar∗, Panchal Nayak, Priyabrata Mallick, Swarup Ranjan Behera, Parabattina Bhagath, Pailla Balakrishna Reddy, Arun Balaji Buduru
APSIPA ASC 2025 PDFThis paper investigates the polyglot (multilingual) speech foundation models (SFMs) for Crowd Emotion Recognition (CER). We hypothesize that polyglot SFMs, pre-trained on diverse languages, accents, and speech patterns, are particularly adept at navigating the noisy and complex acoustic environments characteristic of crowd settings, thereby offering a significant advantage for CER. To substantiate this, we perform a comprehensive analysis, comparing polyglot, monolingual, and speaker recognition SFMs through extensive experiments on a benchmark CER dataset across varying audio durations (1 sec, 500 ms, and 250 ms). The results consistently demonstrate the superiority of polyglot SFMs, outperforming their counterparts across all audio lengths and excelling even with extremely short-duration inputs. These findings pave the way for adaptation of SFMs in setting up new benchmarks for CER
-
Beyond Speech and More: Investigating the Emergent Ability of Speech Pre-Trained Models for Classifying Physiological Time-Series Signals
Orchid Chetia Phukan∗, Swarup Ranjan Behera∗, Girish∗, Mohd Mujtaba Akhtar∗, Arun Balaji Buduru, Rajesh Sharma
APSIPA ASC 2025 PDFDespite being trained exclusively on speech data, speech foundation models (SFMs) like Whisper have shown impressive performance in non-speech tasks such as audio classification. This is partly because speech shares some common traits with audio, enabling SFMs to transfer effectively. In this study, we push the boundaries by evaluating SFMs on a more challenging out-of-domain (OOD) task: classifying physiological time-series signals. We test two key hypotheses: first, that SFMs can generalize to physiological signals by capturing shared temporal patterns; second, that multilingual SFMs will outperform others due to their exposure to greater variability during pre-training, leading to more robust, generalized representations. Our experiments, conducted for stress recognition using ECG (Electrocardiogram), EMG (Electromyography), and EDA (Electrodermal Activity) signals, reveal that models trained on SFM-derived representations outperform those trained on raw physiological signals. Among all models, multilingual SFMs achieve the highest accuracy, supporting our hypothesis and demonstrating their OOD capabilities. This work positions SFMs as promising tools for new uncharted domains beyond speech.
-
NeuRO: an application for code-switched autism detection in children
Mohd Mujtaba Akhtar*, Girish*, Orchid Chetia Phukan*, Muskaan Singh*
INTERSPEECH 2024 Show & Tell PDFCode-switching is a common communication phenomenon where individuals alternate between two or more languages or linguistic styles within a single conversation.Autism Spectrum Disorder(ASD) is a developmental disorder posing challenges in social interaction, communication, and repetitive behaviors.Detecting ASD in individuals with code-switch scenario presents unique challenges. In this paper, we address this problem by building an application NeuRO which aims to detect potential signs of autism in code-switched conversations, facilitating early intervention and support for individuals with ASD. Index Terms: speech recognition, human-computer interaction, computational paralinguistics
-
Speech-Based Alzheimer’s Disease Classification System with Noise-Resilient Features Optimization
Virendra kadyan, Puneet bawa, Mohd Mujtaba Akhtar , Muskaan singh
AICS 2023 PDFAlzheimer’s disease is a severe neurological disorder having a major influence on a substantial portion of the popu- lation. The prompt detection of this condition is crucial, and speech analysis may play a crucial role in facilitating efficient treatment and care. The main aim of this research has been to investigate the significance of timely identification of speech signal abnormalities associated with Alzheimer’s disease in order to provide effective therapy interventions and improve disease man- agement. The study used the Mel Frequency Cepstral Coefficients (MFCC) framework, a well recognized technique for feature extraction known for its versatility across several domains. This research introduces an innovative approach that utilizes both individuals diagnosed with dementia and control participants to detect two unique types of cognitive impairment via the analysis of speech signals. The approach used in this work involves the extraction of acoustic properties from pre-processed speech data obtained from the Pitt Corpus of Dementia Bank. This is achieved by using several feature sets, which include a combination of MFCC, prosodic features, and statistical features. This study examines the attributes of optimum feature optimization in actual and noise-enhanced speech environments using machine learning techniques. The integration of MFCC,Statistical and prosodic features has shown remarkable outcomes, exhibiting a superior accuracy rate of 98.3%. This surpasses the performance of other feature combinations when using the Random Forest classifier. Index Terms—Alzheimer’s Disease, MFCC features, Prosodic Feature, Statistical Feature, Machine Learning, Classification
Journal Publications
-
Optimizing Audio Encryption Efficiency: A Novel Framework Using Double DNA Operations and Chaotic Map-Based Techniques
Mohd Mujtaba Akhtar, Muskaan Singh, Virender Kadyan, Mohit Dua
Computer’s & Electrical Engineering 2025 PDFIn this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model (FM) representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.
awards & achievements
a list of my achievements throughout the years.
- Attended EUSIPCO 2025 (8–12 Sep, Palermo, Italy); engaged with leading researchers and latest advances in signal processing. Sep 2025 CONFERENCE LINK
- Volunteer with ISCA-SAC: scripting, hosting discussions, and post-editing for the global speech community. 2024–Present SERVICE
- Our NEST lab ranked among Asia’s top research groups with seven papers at INTERSPEECH 2025. 2025 LAB
- Virtusa Engineering Excellence Scholarship — awarded in 2023 for outstanding all-round performance in academics and co-curricular activities; received ₹40,000 as the sole recipient in the college. DEC 2023 SCHOLARSHIP
- First Place — National Science Fair for designing an autonomous navigation system butilizing haptic feedback to assist visually impaired individuals in independent mobility. NOV 2022 WINNER
- Best Communicator Award (12th Grade). Jan 2020 SCHOOL
- Innovation in Extracurriculars — School Award (12th). Nov 2019 SCHOOL
-
Best All-Rounder Student (12th, 2019–2020) with a
sponsored domestic tour
prize.
Sep 2019
SCHOOL
- Top Scorer — National Mathematics Olympiad (11th Grade). Oct 2018 OLYMPIAD
- Most Punctual Student — Class 10 Recognition. Aug 2017 SCHOOL
- Sportsmanship Award — School Annual Sports Meet (9th). Dec 2016 SPORTS
news
all news reversed chronological order.
- Oct 2025 IJCNLP–AACL 2025 — 2 papers accepted as a first author.
- Sept 2025 Travelled to Palermo, Italy, for attending the EUSIPCO 2025 conference.
- Aug 2025 4 papers accepted at APSIPA ASC 2025 as a first author.
- Jun 2025 7 papers accepted at INTERSPEECH 2025 6 as a first author.
- Jun 2025 2 papers accepted at EUSIPCO 2025 as a first author.
- Jun 2025 1 paper accepted at ICASSP 2025 as a second author.
- Jan 2025 Published an audio-encryption framework in Elsevier’s Computers & Electrical Engineering journal.
- Dec 2024 Reviewer for Neural Computing & Applications(Journal), ICME 2025 (Conference) , and ICASSP 2026.
- Sep 2024 Volunteered with ISCA-SAC, scripting, hosting community discussions, and post-editing for the speech community.
- Sep 2024 Research Intern at Reliance Jio AICoE, developing MATA (Modality Alignment through Transport Attention) for non-verbal emotion recognition (NVER).
- Jun 2024 Computer Vision Intern at Suratec (Bangkok), developing real-time golf-swing phase detection with live UI feedback.
- May 2024 Began work as an ML Engineer at Artviewings (California), building multilingual AVQA datasets (8 languages) and developing MERA-series multimodal QA models.
- May 2024 Received degree First-Class with Distinction in B.Tech (AI-ML) from UPES.
- Feb 2024 Submitted multiple first-author and co-authored papers to INTERSPEECH, ICASSP, and EUSIPCO 2025.
- Jan 2024 Joined IIIT-Delhi as a Research Associate in USG (Usable Security Group) LAB .
- Jun 2023 Completed a Software Developer Internship at IBM, building a 3D RL-based robot with vision, pick-and-place, and autonomous decisions.