آخر تحديث :

June 24, 2026

How a GCC Telco Built an Arabic Speech-to-Text Dataset from Call Archives

دراسات الحالة

الذكاء الاصطناعي للمؤسسات

المؤلف

سارة تركي

Khalid Ghiboub

5 دقائق قراءة

جدول المحتويات

1 .

The Challenge

2 .

The Data Pipeline Problem

اطرح الذكاء الاصطناعي الصوتي العربي في الإنتاج

تحويل الكلام إلى نص والنص إلى كلام باللغة العربية بمستوى أصلي

مصمم لحكومات وشركات دول مجلس التعاون الخليجي

نشر سيادي ومحلي

احجز عرضًا توضيحيًا

شكرًا لك! لقد تم استلام طلبك!

عذرًا! حدث خطأ ما أثناء إرسال النموذج.

النقاط الرئيسية

Arabic Voice AI models trained on public MSA data will underperform on real customer speech in Gulf markets; the dialect gap is real and measurable.

‍

Most telcos already hold the right raw material in their call archives. The missing piece is the infrastructure to process it at scale.

‍

A high-accuracy Arabic STT layer combined with a specialized Arabic annotation capability can convert that archive from a storage cost into a strategic AI training asset.

‍

The pipeline is repeatable, meaning the dataset grows as the business does, without starting from scratch each time.

‍

A GCC telecom operator transformed 10,000 archived customer calls into a high-quality Arabic speech-to-text training dataset using Munsit STT and expert annotation. The resulting Gulf dialect dataset improved intent-classification accuracy and created a scalable foundation for future AI model development.

‍

The Challenge

Telcos building AI for customer-facing applications, intent classification, sentiment analysis, churn prediction, and virtual agents need training data. For Arabic-language models, that data is hard to find. Publicly available Arabic speech datasets are mostly Modern Standard Arabic sourced from news broadcasts. They don't reflect how customers actually speak when calling a telco in the Gulf.
‍

A GCC telco's data science team had a clear use case: fine-tune an intent classification model covering billing, technical support, plan changes, complaints, and service requests. The training data had to reflect real customer language, Gulf dialect, code-switching, and the specific product vocabulary their customers used every day.
‍

They already had hundreds of thousands of call recordings in their archive. In theory, this was exactly the training source they needed. The recordings were unusable for training without transcription, diarization, and labeling.

‍

Lorem ipsum dolor

لوريم إيبسوم ألم

Lorem ipsum dolor

The Data Pipeline Problem

فهم أصول هلوسات الذكاء الاصطناعي هو الخطوة الأولى نحو التخفيف منها. هذه الظاهرة ليست مشكلة واحدة، بل هي قضية معقدة ذات عوامل متعددة تساهم فيها.

أوجه القصور في بيانات التدريب

Transcribing and labeling call recordings at scale is expensive and slow. The team had received a quote from a general transcription provider for 10,000 calls. The cost was high, the turnaround was weeks, and the provider had no specific capability in Gulf Arabic, meaning every transcript would need native speaker review before it could be used for training.
‍

What the team needed was a pipeline that could handle three things:
‍

Transcribe Arabic calls at scale with Gulf dialect accuracy
‍
Segment transcripts by speaker so customer utterances could be extracted separately from agent responses
‍
Pass the resulting text to a labeling workflow where customer intent could be classified

‍

أوجه القصور في بيانات التدريب

العامل الأكثر أهمية في هلوسات الذكاء الاصطناعي هو البيانات التي تُدرّب عليها النماذج. تتعلم النماذج اللغوية الكبيرة (LLMs) من مجموعات بيانات ضخمة مجمعة من الإنترنت، والتي تحتوي على مزيج من المعلومات الواقعية والآراء والمعلومات المضللة والتحيزات. يمكن أن تؤدي العديد من المشكلات المحددة المتعلقة بالبيانات إلى الهلوسات:

حالات استخدام الذكاء الاصطناعي الصوتي العربي في الشركات لعام 2025

يفتح التحول نحو أنظمة التعرف التلقائي على الكلام (ASR) العربية التي تراعي اللهجات، آفاقاً جديدة لتطبيقات الشركات في جميع أنحاء منطقة الخليج والشرق الأوسط وشمال إفريقيا. تتجاوز المؤسسات الآن النسخ الأساسي لتصل إلى تحليلات كلام عربية متطورة.

تشهد تقنية الكلام العربية تطوراً سريعاً في عام 2025، مدفوعة بالنماذج اللغوية الضخمة متعددة اللغات والنماذج الأساسية الجديدة التي تركز على اللغة العربية.

تتقدم تقنية الكلام العربية بسرعة في عام 2025، مدفوعة بالنماذج اللغوية الضخمة متعددة اللغات ونماذج الأساس الجديدة المرتكزة على اللغة العربية.

The Approach

فهم أصول هلوسات الذكاء الاصطناعي هو الخطوة الأولى نحو التخفيف منها. هذه الظاهرة ليست مشكلة واحدة بل هي قضية معقدة ذات عوامل متعددة تساهم فيها.

أوجه القصور في بيانات التدريب

CNTXT AI addressed both stages of the pipeline directly. Munsit STT processed a batch of 10,000 call recordings from the telco's archive via the API in batch mode. Each call was returned as a speaker-diarized transcript, with customer utterances automatically extracted and separated from agent turns.
‍

‍

Those customer utterances were then passed to CNTXT AI's Arabic data annotation team for intent labeling. Annotators classified each utterance against a 28-category taxonomy built jointly with the telco's data science team, covering intent categories specific to their service context, not generic call centre categories. Quality control included double annotation on 15% of utterances, with inter-annotator agreement measured and resolved by a senior reviewer.
‍

‍

The final output was a labeled Arabic speech dataset specific to the telco's customer interaction domain and formatted for direct use in the team's fine-tuning workflow.

‍

أوجه القصور في بيانات التدريب

أكبر عامل مساهم في هلوسات الذكاء الاصطناعي هو البيانات التي تُدرب عليها النماذج. تتعلم نماذج اللغة الكبيرة (LLMs) من مجموعات بيانات ضخمة مجمعة من الإنترنت، والتي تحتوي على مزيج من المعلومات الواقعية والآراء والمعلومات المضللة والتحيزات. يمكن أن تؤدي العديد من المشكلات المحددة المتعلقة بالبيانات إلى الهلوسات:

حالات استخدام المؤسسات للذكاء الاصطناعي الصوتي العربي في عام 2025

يفتح الانتقال إلى أنظمة التعرف التلقائي على الكلام (ASR) العربية المدركة للهجات موجة جديدة من تطبيقات المؤسسات عبر مناطق مجلس التعاون الخليجي والشرق الأوسط وشمال إفريقيا. تتجاوز المؤسسات الآن النسخ الأساسي لتصل إلى تحليلات الكلام العربية المتطورة.

يتطلب بناء أنظمة ذكاء اصطناعي أفضل اتباع النهج الصحيح

نساعد في تقديم حلول مخصصة، وخطوط أنابيب البيانات، والذكاء العربي.

اعرف المزيد

Results

أوجه القصور في بيانات التدريب

The data science team had a usable training dataset within six weeks of project start. The manual route would have taken months.

‍

The intent classification model fine-tuned on this dataset outperformed the version trained on public Arabic data on the telco's internal evaluation set. The improvement was most visible in two areas:
‍

Gulf dialect inputs, where public MSA training data consistently fell short
‍
Product-specific terminology, vocabulary that appeared frequently in the telco's calls but was absent from broadcast Arabic datasets
‍

Beyond the initial model improvement, the team now has a repeatable pipeline for expanding training data as the product portfolio grows and new intent categories emerge. That same pipeline is currently being used to build a sentiment analysis training set from a separate call sample.

‍

أوجه القصور في بيانات التدريب

المساهم الأكبر في هلوسات الذكاء الاصطناعي هو البيانات التي تُدرّب عليها النماذج. تتعلم النماذج اللغوية الكبيرة (LLMs) من مجموعات بيانات ضخمة مجمعة من الإنترنت، والتي تحتوي على مزيج من المعلومات الواقعية والآراء والمعلومات المضللة والتحيزات. يمكن أن تؤدي عدة مشكلات محددة متعلقة بالبيانات إلى الهلوسات:

حالات الاستخدام المؤسسية للذكاء الاصطناعي الصوتي العربي في عام 2025

يفتح الانتقال إلى تقنية التعرف التلقائي على الكلام (ASR) للغة العربية المدركة للهجات آفاقًا جديدة لتطبيقات الشركات في جميع أنحاء منطقة الخليج والشرق الأوسط وشمال إفريقيا. تتجاوز المؤسسات النسخ الأساسي لتصل إلى تحليلات الكلام العربية المتطورة.

تتطور تقنية الكلام العربية بسرعة في عام 2025، مدفوعة بالنماذج اللغوية الضخمة متعددة اللغات والنماذج التأسيسية الجديدة المرتكزة على اللغة العربية.

يُعد فهم أصول هلوسات الذكاء الاصطناعي الخطوة الأولى نحو التخفيف منها. هذه الظاهرة ليست مشكلة واحدة بل قضية معقدة ذات عوامل متعددة تساهم فيها.

أوجه القصور في بيانات التدريب

حالات الاستخدام المؤسسية للذكاء الاصطناعي الصوتي العربي في عام 2025

تتقدم تقنية الكلام العربية بسرعة في عام 2025، مدفوعة بالنماذج اللغوية المتعددة الضخمة والنماذج التأسيسية الجديدة المرتكزة على اللغة العربية.

Understanding the origins of AI hallucinations is the first step toward mitigating them. The phenomenon is not a single problem but rather a complex issue with multiple contributing factors.

Training Data Deficiencies

The most significant contributor to AI hallucinations is the data on which the models are trained. LLMs learn from vast datasets scraped from the internet, which contain a mixture of factual information, opinions, misinformation, and biases. Several specific data-related issues can lead to hallucinations:

Enterprise Use Cases for Arabic Voice AI in 2025

The move to dialect-aware Arabic ASR is unlocking a new wave of enterprise applications across the GCC and MENA regions. Organizations are moving beyond basic transcription to sophisticated Arabic speech analytics.