Fujitsu's Dippu Singh on AI for Voice Data Analysis

Dippu Kumar Singh from Fujitsu outlines an AI-powered "VoiceOps" framework for contact centers, detailing its architecture, benefits, and future development.

4 min read
Fujitsu's Dippu Singh on AI for Voice Data Analysis
AI Engineer

Dippu Kumar Singh, Leader of Emerging Technologies at Fujitsu North America Inc., presented a detailed look at "VoiceOps-fying Low-Latency Intelligence Extraction from Messy Audio Streams." The discussion focused on how artificial intelligence can be applied to the complex data found in customer service calls to extract actionable business intelligence with minimal human intervention.

Fujitsu's Dippu Singh on AI for Voice Data Analysis - AI Engineer
Fujitsu's Dippu Singh on AI for Voice Data Analysis — from AI Engineer

Understanding Contact Center Challenges

Singh highlighted the current operational challenges faced by contact centers. These include difficulties with recruitment and training, maintaining quality and productivity, operational efficiency, and staff retention. Data indicates that over 50% of contact centers identify hiring and productivity as critical barriers to success.

The core mission is to shift focus from merely "handling calls" to "analyzing VOC (Voice of Customer)" for business growth. This requires transforming raw conversational audio into structured business intelligence.

Related startups

The Proposed AI Solution Architecture

Fujitsu's solution is structured around a four-component pipeline:

  • Voice Capture: This initial step involves capturing raw, high-fidelity audio data. It includes audio intake with normalization and noise filtering to standardize audio levels and remove background chatter. A crucial security layer ensures secure streaming and early-stage sensitive data protection, including buffer management and PII masking.
  • Speech-To-Text (STT) Engine: This component converts speech phonemes to high-accuracy digital text. It comprises acoustic modeling for interpreting raw sound into linguistic units across dialects, language logic for applying language-specific dictionaries for accuracy, and post-processing for tasks like inverse text normalization and auto-punctuation.
  • Generative AI Core: This is the LLM-driven reasoning engine for intent, sentiment, and summary extraction. It involves orchestration using prompt engines and few-shot libraries to guide the LLM. Reasoning involves intent extraction and sentiment scoring to determine the "why" behind the call and customer emotion. A trust layer ensures the summary is factually grounded in the transcript through hallucination checks and token optimization.
  • Customer Data Sync: This final component translates AI insights into enterprise system actions. It utilizes an API gateway with schema mappers and REST bridges to map AI fields to CRM database fields. Verification steps include field validation and agent confirmation to allow operators to review and approve auto-summaries. Business intelligence is then generated through VOC aggregation and FAQ generation, feeding categorized data into executive dashboards and FAQs.

The system's overarching goal is to transform raw conversational audio into structured business intelligence with minimal human intervention.

Call Time vs. After-Call Work (ACW)

A significant operational metric is the balance between call time and after-call work (ACW). Currently, the average call time is 6.6 minutes, with an average post-processing time of 6.3 minutes, resulting in a nearly 1:1 ratio. Summarization quality can vary by operator skill, leading to inconsistency.

Targeting ACW with AI can reduce post-processing time by an estimated 50%. This means reducing ACW from 6.3 minutes to approximately 3.1 minutes. This efficiency gain is substantial, as 79.2% of centers expect AI to provide gains in this area.

Key Outcomes of AI Implementation

The implementation of this AI-powered system yields several key outcomes:

  • ACW Time Reduction: A 50% reduction in ACW time, from 6.3 minutes to 3.1 minutes per call.
  • Data Entry Quality: Improvement from variable/subjective manual operations to standardized, highly uniform AI-powered output.
  • Inquiry Categorization: A shift from skill-dependent manual categorization to logic-based, consistent VOC tagging.
  • Staff Turnover: Reduced burden on agents, who currently experience high, stress-linked turnover, leading to stabilized operations.

Key Constraints and Future Roadmap

Several key constraints were identified:

  • STT Accuracy: Summarization quality is directly tied to the accuracy of the initial Speech-To-Text conversion. Engines with >90% accuracy are recommended.
  • Initial Setup Cost: The initial consumption of API tokens and associated costs can be high during early adoption phases.
  • Security & Compliance: Handling Personally Identifiable Information (PII) requires robust masking and secure cloud environments, posing a complexity challenge.

The roadmap ahead involves several phases:

  • Phase 1: Explainable AI (XAI) to provide operators with post-call feedback to improve soft skills and accuracy.
  • Phase 2: Optimal Predictive Staffing, anticipating call volume spikes using advanced time-series analytics.
  • Phase 3: Combating Customer Harassment for operator mental health using sentiment analysis.

Fujitsu is actively working to refine these components, particularly in optimizing the STT accuracy with various dialects and mitigating the cost and complexity associated with security and compliance measures.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.