Mohd Mujtaba Akhtar

Research Associate @ IIIT-Delhi mmakhtar.research@gmail.com

Hello! I am Mohd Mujtaba Akhtar. My core research interests revolve around speech and audio processing, with a particular focus on audio deepfake detection, emotional speech understanding, and the application of multimodal and foundation models in both behavioral and forensic domains.

I am currently working on audio deepfake detection using transfer learning from foundation models as part of my thesis work.

I am actively seeking research collaboration opportunities in areas such as speech/audio deepfake detection, audio-visual deepfake detection, speech emotion recognition, affective computing, and speech-driven healthcare applications.

🤝 I'm always interested in collaborating on exciting projects! If you have an idea or opportunity in mind, feel free to reach out. You can contact me at mmakhtar.research@gmail.com.

I am proactively seeking PhD opportunities for Fall 2026 with a commitment to advancing research excellence. I welcome opportunities to collaborate with faculty and research groups aligned with my expertise.

I am passionate about pushing the boundaries of what is possible with speech and audio technology.

Research Interests

  • Geometric Deep Learning
  • Computational Speech Analysis
  • Deepfake Forensics
  • Emotional Speech Understanding
  • Healthcare Applications (Speech)
  • Multimodal/Foundation Models

News

all news …

Selected Publications

  • SNIFR paper figure INTERSPEECH 2025

    SNIFR: Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual
    Alignment with Cascaded Cross-Transformer

    Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama, Sarthak Jain, Priyabrata Mallick, Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

    INTERSPEECH 2025 PDF

    As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art. Index Terms: Child Unsafe Content, Multimodal Learning, Cross-Transformer

  • Are Mamba-based AFMs paper figure EUSIPCO 2025

    Are Mamba-based Audio Foundation Models the Best Fit for Non-Verbal Emotion Recognition?

    Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Girish, Swarup Ranjan Behera, Ananda Chandra Nayak, Sanjib Kumar Nayak, Arun Balaji Buduru, Rajesh Sharma

    EUSIPCO 2025 PDF

    In this work, we focus on non-verbal vocal sounds emotion recognition (NVER). We investigate mamba-based audio foundation models (MAFMs) for the first time for NVER and hypothesize that MAFMs will outperform attention-based audio foundation models (AAFMs) for NVER by leveraging its state-space modeling to capture intrinsic emotional structures more effectively. Unlike AAFMs, which may amplify irrelevant patterns due to their attention mechanisms, MAFMs will extract more stable and context-aware representations, enabling better differentiation of subtle non-verbal emotional cues. Our experiments with state-of-the-art (SOTA) AAFMs and MAFMs validates our hypothesis. Further, motivated from related research such as speech emotion recognition, synthetic speech detection, where fusion of foundation models (FMs) have showed improved performance, we also explore fusion of FMs for NVER. To this end, we propose, RENO, that uses renyidivergence as a novel loss function for effective alignment of the FMs. It also makes use of self-attention for better intrarepresentation interaction of the FMs. With RENO, through the heterogeneous fusion of MAFMs and AAFMs, we show the topmost performance in comparison to individual FMs, its fusion and also setting SOTA in comparison to previous SOTA work. Index Terms—Non-Verbal Emotion Recognition, Mambabased Audio Foundation Models, Attention-based Audio Foundation Models

  • Strong Alone… figure ICASSP 2025

    Strong Alone, Stronger Together: Synergizing Modality-Binding Foundation Models with Optimal Transport
    for Non-Verbal Emotion Recognition

    Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Sishir Kalita, Arun Balaji Buduru, Rajesh Sharma, S. R. Mahadeva Prasanna

    ICASSP 2025 PDF

    In this study, we investigate multimodal foundation models (MFMs) for emotion recognition from non-verbal sounds. We hypothesize that MFMs, with their joint pre-training across multiple modalities, will be more effective in non-verbal sounds emotion recognition (NVER) by better interpreting and differentiating subtle emotional cues that may be ambiguous in audio-only foundation models (AFMs). To validate our hypothesis, we extract representations from state-of-the-art (SOTA) MFMs and AFMs and evaluated them on benchmark NVER datasets. We also investigate the potential of combining selected foundation model (FM) representations to enhance NVER further inspired by research in speech recognition and audio deepfake detection. To achieve this, we propose a framework called MATA (Intra-Modality Alignment through Transport Attention). Through MATA coupled with the combination of MFMs: LanguageBind and ImageBind, we report the topmost performance with accuracies of 76.47%, 77.40%, 75.12% and F1-scores of 70.35%, 76.19%, 74.63% for ASVP-ESD, JNV, and VIVAE datasets against individual FMs and baseline fusion techniques and report SOTA on the benchmark datasets.Index Terms: Non-Verbal Emotion Recognition, Multimodal Foundation Models, LanguageBind, ImageBind

Profile Photo
@ IIIT DELHI
PHASE-III, NEW DELHI, 110020, INDIA