Case Studies
l 5min

How a GCC Telco Built an Arabic Speech-to-Text Dataset from Call Archives

Enterprise AI
Author
Khalid Ghiboub

Powering the Future with AI

Join our newsletter for insights on cutting-edge technology built in the UAE

Key Takeaways

1

Arabic Voice AI models trained on public MSA data will underperform on real customer speech in Gulf markets; the dialect gap is real and measurable.

2

Most telcos already hold the right raw material in their call archives. The missing piece is the infrastructure to process it at scale.

3

A high-accuracy Arabic STT layer combined with a specialized Arabic annotation capability can convert that archive from a storage cost into a strategic AI training asset.

4

The pipeline is repeatable, meaning the dataset grows as the business does, without starting from scratch each time.

A GCC telecom operator transformed 10,000 archived customer calls into a high-quality Arabic speech-to-text training dataset using Munsit STT and expert annotation. The resulting Gulf dialect dataset improved intent-classification accuracy and created a scalable foundation for future AI model development.

The Challenge

Telcos building AI for customer-facing applications, intent classification, sentiment analysis, churn prediction, and virtual agents need training data. For Arabic-language models, that data is hard to find. Publicly available Arabic speech datasets are mostly Modern Standard Arabic sourced from news broadcasts. They don't reflect how customers actually speak when calling a telco in the Gulf.

A GCC telco's data science team had a clear use case: fine-tune an intent classification model covering billing, technical support, plan changes, complaints, and service requests. The training data had to reflect real customer language, Gulf dialect, code-switching, and the specific product vocabulary their customers used every day.

They already had hundreds of thousands of call recordings in their archive. In theory, this was exactly the training source they needed. The recordings were unusable for training without transcription, diarization, and labeling.

This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

The Data Pipeline Problem

Transcribing and labeling call recordings at scale is expensive and slow. The team had received a quote from a general transcription provider for 10,000 calls. The cost was high, the turnaround was weeks, and the provider had no specific capability in Gulf Arabic, meaning every transcript would need native speaker review before it could be used for training.

What the team needed was a pipeline that could handle three things:

  • Transcribe Arabic calls at scale with Gulf dialect accuracy
  • Segment transcripts by speaker so customer utterances could be extracted separately from agent responses
  • Pass the resulting text to a labeling workflow where customer intent could be classified

This is some text inside of a div block.

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

The Approach

CNTXT AI addressed both stages of the pipeline directly. Munsit STT processed a batch of 10,000 call recordings from the telco's archive via the API in batch mode. Each call was returned as a speaker-diarized transcript, with customer utterances automatically extracted and separated from agent turns.

Those customer utterances were then passed to CNTXT AI's Arabic data annotation team for intent labeling. Annotators classified each utterance against a 28-category taxonomy built jointly with the telco's data science team, covering intent categories specific to their service context, not generic call centre categories. Quality control included double annotation on 15% of utterances, with inter-annotator agreement measured and resolved by a senior reviewer.

The final output was a labeled Arabic speech dataset specific to the telco's customer interaction domain and formatted for direct use in the team's fine-tuning workflow.

Results

The data science team had a usable training dataset within six weeks of project start. The manual route would have taken months.

The intent classification model fine-tuned on this dataset outperformed the version trained on public Arabic data on the telco's internal evaluation set. The improvement was most visible in two areas:

  • Gulf dialect inputs, where public MSA training data consistently fell short
  • Product-specific terminology, vocabulary that appeared frequently in the telco's calls but was absent from broadcast Arabic datasets

Beyond the initial model improvement, the team now has a repeatable pipeline for expanding training data as the product portfolio grows and new intent categories emerge. That same pipeline is currently being used to build a sentiment analysis training set from a separate call sample.

See how Munsit performs on real Arabic speech

Evaluate dialect coverage, noise handling, and in-region deployment on data that reflects your customers.
Explore

FAQ

Why is public Arabic speech data not enough for telco AI models?
How does Munsit STT help build Arabic speech-to-text datasets for call centres?
What role does Arabic data annotation play in intent classification?

Powering the Future with AI

Join our newsletter for insights on cutting-edge technology built in the UAE
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Last update :
June 24, 2026

How a GCC Telco Built an Arabic Speech-to-Text Dataset from Call Archives

Case Studies
Enterprise AI
Author
Sarra Turki
Khalid Ghiboub
5min read

Bring Arabic Voice AI to production

Native‑level Arabic STT & TTS
Built for GCC gov & enterprises
Sovereign and on‑prem deployment
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Key Takeaways

Arabic Voice AI models trained on public MSA data will underperform on real customer speech in Gulf markets; the dialect gap is real and measurable.

Most telcos already hold the right raw material in their call archives. The missing piece is the infrastructure to process it at scale.

A high-accuracy Arabic STT layer combined with a specialized Arabic annotation capability can convert that archive from a storage cost into a strategic AI training asset.

The pipeline is repeatable, meaning the dataset grows as the business does, without starting from scratch each time.

A GCC telecom operator transformed 10,000 archived customer calls into a high-quality Arabic speech-to-text training dataset using Munsit STT and expert annotation. The resulting Gulf dialect dataset improved intent-classification accuracy and created a scalable foundation for future AI model development.

The Challenge

Telcos building AI for customer-facing applications, intent classification, sentiment analysis, churn prediction, and virtual agents need training data. For Arabic-language models, that data is hard to find. Publicly available Arabic speech datasets are mostly Modern Standard Arabic sourced from news broadcasts. They don't reflect how customers actually speak when calling a telco in the Gulf.

A GCC telco's data science team had a clear use case: fine-tune an intent classification model covering billing, technical support, plan changes, complaints, and service requests. The training data had to reflect real customer language, Gulf dialect, code-switching, and the specific product vocabulary their customers used every day.

They already had hundreds of thousands of call recordings in their archive. In theory, this was exactly the training source they needed. The recordings were unusable for training without transcription, diarization, and labeling.

Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor
Lorem ipsum dolor

The Data Pipeline Problem

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

Transcribing and labeling call recordings at scale is expensive and slow. The team had received a quote from a general transcription provider for 10,000 calls. The cost was high, the turnaround was weeks, and the provider had no specific capability in Gulf Arabic, meaning every transcript would need native speaker review before it could be used for training.

What the team needed was a pipeline that could handle three things:

  • Transcribe Arabic calls at scale with Gulf dialect accuracy
  • Segment transcripts by speaker so customer utterances could be extracted separately from agent responses
  • Pass the resulting text to a labeling workflow where customer intent could be classified

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

The Approach

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

CNTXT AI addressed both stages of the pipeline directly. Munsit STT processed a batch of 10,000 call recordings from the telco's archive via the API in batch mode. Each call was returned as a speaker-diarized transcript, with customer utterances automatically extracted and separated from agent turns.

Those customer utterances were then passed to CNTXT AI's Arabic data annotation team for intent labeling. Annotators classified each utterance against a 28-category taxonomy built jointly with the telco's data science team, covering intent categories specific to their service context, not generic call centre categories. Quality control included double annotation on 15% of utterances, with inter-annotator agreement measured and resolved by a senior reviewer.

The final output was a labeled Arabic speech dataset specific to the telco's customer interaction domain and formatted for direct use in the team's fine-tuning workflow.

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Building better AI systems takes the right approach

We help with custom solutions, data pipelines, and Arabic intelligence.

Results

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

The data science team had a usable training dataset within six weeks of project start. The manual route would have taken months.

The intent classification model fine-tuned on this dataset outperformed the version trained on public Arabic data on the telco's internal evaluation set. The improvement was most visible in two areas:

  • Gulf dialect inputs, where public MSA training data consistently fell short
  • Product-specific terminology, vocabulary that appeared frequently in the telco's calls but was absent from broadcast Arabic datasets

Beyond the initial model improvement, the team now has a repeatable pipeline for expanding training data as the product portfolio grows and new intent categories emerge. That same pipeline is currently being used to build a sentiment analysis training set from a separate call sample.

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

1

Training Data Deficiencies

2

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

Arabic speech technology is rapidly advancing in 2025, driven by massive multilingual models and new Arabic-centric foundation models.

FAQ
Why is public Arabic speech data not enough for telco AI models?
How does Munsit STT help build Arabic speech-to-text datasets for call centres?
What role does Arabic data annotation play in intent classification?
Can the same pipeline be used for other AI use cases beyond intent classification?

Bring Arabic Voice AI to production

Native‑level Arabic STT & TTS
Built for GCC gov & enterprises
Sovereign and on‑prem deployment
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Start free.  
Pay when you are ready.

10,000 credits. Test Munsit with your own audio, in your own dialect, and see the accuracy for yourself.