MathJax example

↩

Review: Topic Modeling for Arabic Language

MathJax example

In this blog post, we will review recent trends in topic modeling for short text via pre-trained language models. More specifically, we will focus on modeling short text for non-Latin languages such as Arabic and Japanese. In the first part (this blog), we will describe only text modeling for Arabic language.

Our code is available on Github and for quick start try interactive demo.

Also, as bonus material, we will demonstrate how to extract topics using GPT-4 .

Introduction

Topic modeling is a task that makes data more interpretable by applying probabilistic learning methods in an unsupervised way. These probabilistic models can take advantage of statistical techniques to gather a short condensed representation of large data to highlight hidden relations (e.g., Latent Semantic Analysis , Latent Dirichlet Allocation , etc.). More specifically, topic modelings are useful in learning hidden relations between documents by capturing the probability distribution over terms within a document. As topic modeling can capture latent patterns in an unsupervised way, is the perfect tool to examine web text data such as Twitter data. In particular, by learning the emergence of topics and themes, the model can capture the interest of the average user or opinion mining of the general public, and most importantly detect hidden trends. In this post blog, we will discuss topic modeling for Arabic Twitter.

Varieties of Arabic Langauge

According to Google, Arabic is spoken by more than 422 million people around the world. There are 30 modern types, including Modern Standard Arabic. The main three types of the Arabic language are:

Classical Arabic (CA) "ٱلْعَرَبِيَّةُ ٱلْفُصْحَىٰ". Classical Arabic is the standard form of the language that was used in the Middle ages, and the language of the holy book "Qurʾān".

Modern Standard Arabic (MSA) "الُّلغةالَعَربَّيةالحَديَثة". The MSA is the currently used language in writing and speaking formally in the Arab world. Although the MSA is different from Classical Arabic, the morphology and syntax remained unchanged.

Dialectal Arabic "لهجة". The everyday settings used language and the language of social media. Each region in the Arab world has a different dialect (ie., true mother tongue) as shown in Figure 1 below. However, in this blog, we only examine the Gulf Arabic dialect.

Trulli — Figure 1. Geographical distribution of Dialectal Arabic in the arab world. Figure from Wiki (1) Hassaniyya, (2) Moroccan Arabic, (3) Algerian Saharan Arabic, (4) Algerian Arabic, (5) Tunisian Arabic, (6) Libyan Arabic, (7) Egyptian Arabic, (8) Eastern Egyptian Bedawi Arabic, (9) Saidi Arabic, (10) Chadian Arabic, (11) Sudanese Arabic, (12) Sudanese Creole Arabic, (13) Najdi Arabic, (14) Levantine Arabic, (15) North Mesopotamian Arabicm, (16) Mesopotamian Arabic, (17) Gulf Arabic, (18) Baharna Arabic, (19) Hijazi Arabic, (20) Shihhi Arabic, (21) Omani Arabic, (22) Dhofari Arabic, (23) Sanaani Arabic, (24) Ta'izzi-Adeni Arabic, (25) Hadrami Arabica, (26) Uzbeki Arabic, (27) Tajiki Arabic, (28) Cypriot Arabic, (29) Maltese, (30) Nubi, the (Hatched area) minority scattered over the area, and (Dotted area) refers to the mixed dialect. Figure from wikipedia.

Social media (e.g., Twitter, Facebook, etc.) and Broadcast New platforms have served as rich sources for researchers to build dialect-specific tools (dialect detection , dialect gender classification , etc.). In this blog, we employ topic modeling techniques on the Gulf Arabic dialect extracted from Twitter.

Text Mining in Arabic Language

Non-MSA Arabic text is heavily based on social media, more specifically Twitter. The short text comes in varieties of Arabic targeting different populations (ie., regional dialects) as shown the figure 1. For worthwhile text processing such as topic modeling (this blog), sentiment analysis, etc., preprocessing is an essential task to filter out uninformative information (e.g., stopwords, emoji, etc.). We summarized the basic preprocessing for Arabic text mining as follows:

Stopwords Removal. high-frequency words such as conjuncts or definite articles, preventing them from appearing in topics.

Diacritics Removal and Normalization. The diacritic in the Arabic language represents short vowels (Tashkeel) which can be added to the string (بـُـومة boomah = an owl), and is similar to the accent in other Latin languages e.g., in Spanish (España es un país). However, unlike other Latin languages such as Spanish, Catalan, etc., removing the diacritics will not be a problem for formal writing such as newspapers and books for native speakers. However, it will result in word ambiguity as shown in this example:

(1) I sat on the north side of the bridge.
جلستُ "شَمال" الجسر : شَمال أي إتجاه الشمال
(2) I sat on the left side of the bridge.
جلستُ "شِمال" الجسر: شِمال أي إتجاه اليسار

In the first sentence (1) the word "شَمال" means "north" side of the bridge, while in the second sentence "شِمال" refers to the "left" direction of the bridge. Therefore, each word will be treated as a different token and thus result in different lengths. Therefore, to reduce the sparsity and complexity, the diacritics can be completely eliminated from the text.

Data

In this blog, we will use ArabGend dataset a gender based Twitter dataset . The Twitter dataset relies on the user's profile (e.g., gender, occupation (f/m), location, etc.) as the main label for annotation as shown in the Figure below. The dataset extracted 28K unique locations that map only tweets in Arabic countries, with 70% of extracted Twitter accounts are from the Gulf region (Saudi Arabia 54.8%, Egypt 9.3%, Kuwait 6.8%, Yeman 3.5%, Arab Emirates 1.7%, Jorden 1.3%, Qatar 1.3%, etc).

The dataset is significantly imbalanced with a ratio of 85% of tweets written by men. We will only use men's tweets for the purpose of this blog. However, in a real scenario, we could either be collecting more women's tweets or change the loss function to account for the imbalance (e.g., Focal loss ). Focal loss is a modified version of the cross-entropy loss that is used to address the class imbalance problem in classification. For example, the focal loss achieves this by down-weighting techniques that focus on the hard examples i.e., instances that the model struggles with e.g., woman tweets and reduces the influence from the easy examples (i.e., instances that the model can predict correctly with high confidence) e.g., man tweets.

Table 1. Sample of ArabGender dataset extracting profile name, profile data, and location. Example from .
Profile Name	Profile label	Location	Gender
Safia Alshehi	journalist (f)	Dubai	F
صفيه الشحي	كاتبه	دبي	م

The results pre-processed dataset (i.e., w/o diacritics, lemmatization, w/o emoji, etc.) is (100K) 108053 line and (3M) 3175755 tokens. As shown in Table 2 below two tweet samples related to coronavirus.

Table 2. Sample of the pre-processed tweet dataset.
Extracted Tweet
Thank you corona to let us know what people are
شكرا كورونا ، كشفت معادن وايد من الناس
There are no corona cases today in jorden
و لله الحمد لم تسجل اي حالات اصابه بفيروس كورونا في الاردن

Topic Modeling with BERT

BERT . Unlike the most popular ChatGPT next-word prediction autoregressive language model, BERT learns bidirectional contextualized representation for each text fragment or token. This is done by selecting 15% of tokens for (1) 80% masking, (2) 10% randomly replaced, and (3) the rest 10% untouched (original token). The model is trained to predict the missing words based on the surrounding context.

This first pretext task objective of BERT is Mask Language Modeling (MLM). However, MLM is a token-level objective and still, there is the need to learn sentences level representation. To learn these, a Next Sentence Prediction (NSP) is proposed as the second objective to train BERT. However, the authors of RoBERTa argue that NSP objectives do not help the downstream task. Note that, RoBERTa uses data augmentation-like for masking strategies via dynamic masking while the model is in the training mode and unlike BERT where the Mask is prepared during the pre-processing stage.

Arabic BERT . ArabBERT is a pre-trained Arabic language model based on the same BERT we mentioned above. The original BERT was trained using 3.3 Billion words total with 2.5B from Wikipedia and 0.8B from BooksCorpus. However, since the Arabic Wikipedia are smaller than the English ones, the authors of ArabBERT used a combination of different Arabic corpora to train the model as follows:

1.5 billion words Arabic Corpus .
3.5 million articles (∼1B tokens) from (OSIAN) the Open Source International Arabic News Corpus .
Manually scraped Arabic news websites for articles.

The final size of the collected dataset afrer pre-processing is 70 million sentences, corresponding to ∼24GB of text.

BERTopic . BERTtopic is just a framework that leverages different topic modeling on top of BERT embedding (pre-trained BERT). The extracted embedding or dense representations is being clustered using a variation of topic modeling techniques such as (Uniform Manifold Approximation and Projection for Dimension Reduction) UMAP and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). Note that, in BERTopic, UMAP is used as a dimensionality reduction technique before applying any clustering approach (i.e., BSCAN, HDBSCAN, etc.).

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP). is a dimensionality reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. UMAP builds a neighborhood graph in the original space of the data and tries to find a similar graph in the low dimensional space. The main idea is to construct a graph and then map this graph to a lower dimension.

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). is a density-based clustering algorithm that is based on the density of the data points. The algorithm starts with all the data points in a cluster of their own. Then, it iterates over all the data points and merges them with the cluster to which they are most similar. The similarity between two clusters is defined as the minimum distance between any two data points in the two clusters. HDBSCAN is an improved version of Density-based spatial clustering of applications with noise (DBSCAN) . DBSCAN relies on the concept of density to identify clusters. Unlike K-mean clustering, DBSCAN does not require the number of clusters to be specified beforehand. In particular, DBSCAN relies on two parameters to identify clusters: (1) The radius of the circle around each core point ( \(\epsilon \) epsilon) and (2) the minimum number of data points required inside the circle (minPoints). We explain this in more detail below while referring to Figure 3.

DBSCAN establishes a cluster using the minPoint as follows:

(A) If there are four points near the core point we consider this point as core point.
(B, C) If there is fewer than four points near the core point we consider this point as border point.
(N) If there is no point near the core point, we consider this point as noise.

However, using \(\epsilon \) as a radius to cut the tree-like structure (dendrogram) point will result in lost points and noise, which leads to false clustering. HDBSCAN relies on a different approach by adding a parameter to control the cluster. This approach is more intuitive than clustering using the \(\epsilon \) because you never know how big the cluster needs to be.

Implementation

This blog post is inspired by the pilot study of applying BERTopic for MSA standard Arabic. However, here we will apply different clustering techniques (e.g., HDBSCAN) to the Arabic tweets dataset. In addition, unlike the pilot study, we present a comparison result between different models. Not to mention, we apply our models to real raw data extracted from Twitter.

Since the dataset is extracted from Twitter, we heavily pre-process the data to remove the tweets (1) that are not Arabic language, emoji and other non-ASCII characters. In addition, as we mentioned before, we normalize the tweets by removing diacritics (e.g., "ة", "ه").

For BERTopic we employ AraBERT(L) (Twitter version) which is trained on 60M Arabic dialects and tweets. The model uses the same tokenizer (word-piece tokenizer) as the original BERT model. The authors proposed another Sub-Word Units Segmentation method for Arabic. However, we will use the word-piece tokenizer-based model for the sake of comparability with the pilot study.

Result and Discussion

We compared the performance of different BERTopic based topic modeling techniques (LDA, HDBSCAN, and UMAP) on the Arabic tweets dataset. Table 3 shows the results of different BERTopic based topic modeling techniques and the improvement of using HDBSCAN and UMAP on the LDA model. Also, the combined model (HDBSCAN+UMAP) outperforms the other models.

Evaluation Metric. We use coherence metric \( C_{V} \) . This coherence metric is based on normalized pointwise mutual information (NPMI) (PMI: probability of two words co-occurring in a corpus) and the cosine similarity.

Table 3. Comparison Results with different BERTopic based topic modeling techniques (LDA, HDBSCAN, and UMAP). The combined model (HDBSCAN+UMAP) outperforms the other models.
Model	\( C_{V} \)
LDA	0.48896
UMAP	0.62195
HDBSCAN	0.63274
UMAP+HDBSCAN	0.63570

Table 4 shows the random topic extracted from the tweets dataset using BERTopic via UMAP. The first column shows the topic, and the second column shows the coherence score.

Table 4. Topic extracted by UMAP.
Topic (AR/EN)	Score
عناصر member	0.00573665129108314
المتهم accused	0.00552159281093779
الشرطة police	0.003955431150969441
القضائية judgment	0.0050212182104924
اعتقال arrested	0.00477325892023546
المخدرات drug	0.00416250245164534

Table 5 shows the most frequent topic extracted by HDBSCAN on the tweets dataset. The results represent the most frequent topics and trends in the Arab region tweets dataset (2020-2021).

Table 5. Most frequent topic extracted by HDBSCAN on the tweets dataset.
Count	Topics
48138	-1_ان_من_كورونا_الله (coronavirous)
3365	0_عدن_اليمن_الانتقالي_الحوثي (Yeman war)
3088	1_مش_ده_دي_ايه (trend or fahsion)
2719	2_الدوري_الهلال_النصر_الاتحاد (saudi league)
2019	3_الكويت_التعاونيه_الجمعيات_الحظر (charity ban)

Also, we could visualize each topic using Wordcloud as shown in figure 4 below with topic=200.

Conclusion

In this short blog, we discuss Arabic language processing and how to apply BERTopic on the Arabic tweets dataset. We also present a comparison result between different models. Arabic tweets dataset is a good resource for Arabic NLP research. However, we believe that the Arabic tweets or short text dataset are not ideal for processing text to extract useful information (e.g., language modeling). The reason is that the tweets dataset is noisy and it is not representative of the Arabic language. For future work, we will use a larger Arabic dataset extracted from the wiki dump to fine-tined a better model. In addition, we plan to apply different families of language models such as GPT-2 and BART to the Arabic tweets dataset.

Extracting topic with GPT-4

GPT-4 is a new language model that is trained like ChatGPT on human feedback with a reward function to match the human intent or preferences.

Prompting is a technique that can be used to guide the model to generate a specific output. We can use the prompt to guide the model to generate a specific topic. For example, we can prompt the model to generate a topic about the COVID-19 pandemic. However, in this blog post we extract topics from the tweets dataset and therefore we want to employ the model to generate a topic from a given set of tweets. To do so, we can use the following prompt as a template :

To Guide the model: I have topic that contains the following documents:
Input Text
To Guide the model: Based on the information above, extract a short topic label in the following format:
Example of the desired output: topic:

This template will work with any language, next let's apply it to Arabic tweets.

I have topic that contains the following documents:
القحطاني كل عام نستعد لاستقبال شهر رمضان، ورمضان هذا العام هو رمضان الاسره ويجب علينا تطبيق قرارات الجهات المختصه، فرمضان هذا العام ياتي ونحن نواجه جاءحه كورونا ويجب تغير عاداتنا لتتوافق مع ما نمر به معا ضد الكورونا كلنا فريق البحر
Based on the information above, extract a short topic label in the following format:
topic:

topic: استعداد لشهر رمضان ومواجهة كورونا

A tweet about government coronavirus restrictions and Ramadan (fasting month). GPT-4 generated the exact topic: people's preparation for Ramadan.

I have topic that contains the following documents:
بريطانيا تتكفل بدفع من رواتب جميع موظفي القطاع الخاص محلات مطاعم شركات الخ الذين تضرروا بسبب ازمه كورونا شيء ممتاز يريدون الحفاظ علي هياكل الشركات وعدم انهيارها لكي تستانف اعمالها باي وقت ، و لمنع توجه من يفقد عمله نحو الجريمه
Based on the information above, extract a short topic label in the following format:
topic:

topic: بريطانيا تدعم رواتب الموظفين المتضررين من كورونا

Another correct topic generation by GPT-4 of a tweet about the coronavirus and the financial support from the government in the United Kingdom.

Also, we could use multiple documents:

I have topic that contains the following documents:
- صالونات التجميل والحلاقه عيادات البشره فتحت خلاص وصار قرار الذهاب راجع للشخص نفسه وكيف تعرفي انه الي يبيع في المطعم ولا في المول ولا حتي في السوبرماركت مافيه كورونا ؟ الموضوع راجع للحرص و لل
- بريطانيا تتكفل بدفع من رواتب جميع موظفي القطاع الخاص محلات مطاعم شركات الخ الذين تضرروا بسبب ازمه كورونا شيء ممتاز يريدون الحفاظ علي هياكل الشركات وعدم انهيارها لكي تستانف اعمالها باي وقت ، و لمنع توجه من يفقد عمله نحو الجريمه
- مقاطع متفرقه لحيوانات بدات تاخذ حريتها وتدخل المدن حول العالم بعد اختفاء البشر منها بسبب فايروس كورونا الذي الزمهم بالحجر المنزلي
- القحطاني كل عام نستعد لاستقبال شهر رمضان، ورمضان هذا العام هو رمضان الاسره ويجب علينا تطبيق قرارات الجهات المختصه، فرمضان هذا العام ياتي ونحن نواجه جاءحه كورونا ويجب تغير عاداتنا لتتوافق مع ما نمر به معا ضد الكورونا كلنا فريق البحر
Based on the information above, extract a short topic label in the following format:
topic:

topic: التأثيرات المختلفة لفيروس كورونا على المجتمع

Here, again but with multiple documents about coronavirus and its negative impact on society.

This also can work with indirect speech.

I have topic that contains the following documents:
احذر ان يستولي عليك الاحباط فتصبح صفرا في الحياه اصبر ، وقاوم ، وتحمل فالعثره التي تسقطك احيانا تنعشك ازمانا
Based on the information above, extract a short topic label in the following format:
topic:

topic: التغلب على الإحباط والصبر في الحياة

Also, GPT-4 can extract topics from indirect speech or wisdom, and in this example is a short wisdom about overcoming challenges in Life.

I have topic that contains the following documents:
السعداء اشخاص علموا ان الحزن ﻻ يجدد قديما، وﻻ يعيد ماضيا، وﻻ يرد غاءبا فتركوه جانبا، وابتسموا للحياه
Based on the information above, extract a short topic label in the following format:
topic:

topic: التفاؤل والابتسامة في مواجهة الحياة

Another example of indirect speech, this tweet is about being optimistic about life.