Review: Topic Modeling for Arabic Language

MathJax example

In this blog post, we will review recent trends in topic modeling for short text via pre-trained language models. More specifically, we will focus on modeling short text for non-Latin languages such as Arabic and Japanese. In the first part (this blog), we will describe only text modeling for Arabic language.

Our code is available on Github and for quick start try interactive demo.

Introduction

Topic modeling is a task that makes data more interpretable by applying probabilistic learning methods in an unsupervised way. These probabilistic models can take advantage of statistical techniques to gather a short condenss representation of large data and highlight hidden relations (e.g., Latent Semantic Analysis , Latent Dirichlet Allocation , etc.). More specificly, topic modelings are useful in learning hidden relations between documents by capturing the probability distribution over terms within a document. As topic modeling can capture latent patterns in an unsupervised way, is the perfect tool to examine web text data such as Twitter data. In particular, by learning the emergence of topics and themes, the model can capture the interest of the average user or opinion mining of the general public, and most importantly detect hidden trends. In this post blog, we will discuss topic modeling for Arabic Twitter.

Varieties of Arabic Langauge

According to Google Arabic is spoken by more than 422 million people around the world. There are 30 modern types, including Modern Standard Arabic. The main three types of the Arabic language are:

Classical Arabic (CA) "ٱلْعَرَبِيَّةُ ٱلْفُصْحَىٰ". Classical Arabic is the standard form of the language that was used in the Middle ages, and the language of the holy book "Qurʾān".

Modern Standard Arabic (MSA) "الُّلغةالَعَربَّيةالحَديَثة". The MSA is the currently used language in writing and speaking formally in the Arab world. Although the MSA is different from Classical Arabic, the morphology and syntax remained unchanged.

Dialectal Arabic "لهجة". The everyday settings used language and the language of social media. Each region in the Arab world have a different dialect (ie., true mother tongue) as shown in the figure 1 below. However, in this blog we only exmanin the Gulf Arabic (Figure 1 (17)).

Trulli
Figure 1. Geographical distribution of Dialectal Arabic in the arab world. Figure from Wiki (1) Hassaniyya, (2) Moroccan Arabic, (3) Algerian Saharan Arabic, (4) Algerian Arabic, (5) Tunisian Arabic, (6) Libyan Arabic, (7) Egyptian Arabic, (8) Eastern Egyptian Bedawi Arabic, (9) Saidi Arabic, (10) Chadian Arabic, (11) Sudanese Arabic, (12) Sudanese Creole Arabic, (13) Najdi Arabic, (14) Levantine Arabic, (15) North Mesopotamian Arabicm, (16) Mesopotamian Arabic, (17) Gulf Arabic, (18) Baharna Arabic, (19) Hijazi Arabic, (20) Shihhi Arabic, (21) Omani Arabic, (22) Dhofari Arabic, (23) Sanaani Arabic, (24) Ta'izzi-Adeni Arabic, (25) Hadrami Arabica, (26) Uzbeki Arabic, (27) Tajiki Arabic, (28) Cypriot Arabic, (29) Maltese, (30) Nubi, the (Hatched area) minority scattered over the area, and (Dotted area) refers to the mixed dialect. Figure from wikipedia.

Social media (e.g., Twitter, Facebook, etc.) and Broadcast New platforms have served as rich sources for researchers to build dialect-specific tools (dialect detection , dialect gender classification , etc.). In this blog, we employ the Gulf Arabic dialect for topic modeling and gender classification.

Text Mining in Arabic Language

Non-MSA Arabic text is heavily based on social media, more specifically Twitter. The short text comes in varieties of Arabic targeting different populations (ie., regional dialects) as shown the figure 1. For worthwhile text processing such as topic modeling (this blog), sentiment analysis, etc., preprocessing is an essential task to filter out uninformative information (e.g., stopwords, emoji, etc.). We summarized the basic preprocessing for arabic text mining as follows:

Stopwords Removal high-frequency words such as conjuncts or definite articles, preventing them from appearing in topics.

Diacritics Removal and Normalization The diacritic in the Arabic language represents short vowels (Tashkeel) which can be added to the string (), and similar to the accent in other Latin languages e.g., in Spanish (España es un país). However, unlike other Latin languages such as Spanish, Catalan, etc., removing the diacritics will not be a problem for formal writing such as newspapers and books for native speakers. However, it will result in word ambiguity as shown in this example:

(1) I sat on the north side of the bridge.
جلستُ "شَمال" الجسر : شَمال أي إتجاه الشمال
(2) I sat on the left side of the bridge.
جلستُ "شِمال" الجسر: شِمال أي إتجاه اليسار

In the first sentence (1) the word "شَمال" means "north" side of the bridge, while in the second sentence "شِمال" refers to the "left" direction of the bridge. Therefore, each word will be treated as diffrent token and thus result on diffrent length. Therefore, to reduce the sparsity and complexity, the diacritics can be completely eliminated from the text.

Lemmatization and Tokenization

Lemmatization or dictionary form is the process of finding the original or base form of a word by considering its inflected forms. For example removing

Data

In this blog, we will use ArabGend dataset a gender based Twitter dataset . The Twitter dataset relies on the user's profile (e.g.,g gender, occupation (f/m), etc.) as the main label for annotation as shown in the Figure below:

Trulli
Fig 2. ArabGend data extraction via profile gender and location. Figure from .

The dataset is significantly imbalanced with a ratio of 85% tweet written by men. We will only use men's tweets for the purpose of this blog. However, in a real scenario, we could either be collecting more women tweet or change the loss function to account for the imbalance (e.g., Balanced Cross Entropy or Focal loss ). Focal loss is a loss function that is used to address the class imbalance problem in classification. For example, the focal loss achieves this by down-weighting techniques that focus on the hard examples (e.g., woman tweets) and reduces the influence from the easy examples (e.g., man tweets).

 Profile Name  Profile label  Location  Gender
 Safia Alshehi journalist (f) Dubai F
 صفيه الشحي كاتبه دبي م
Table 1. Sample of ArabGender dataset extracting profile name, profile data, and location. Example from .

The cleaned pre-processed dataset (i.e., w/o diacritics, lemmatization, w/o emoji, etc.) is (100K) 108053 lines and (3M) 3175755 tokens. As shown in Table 2 below two tweet samples related to coronaviruse.

 Extracted Tweet
 Thank you corona to let us know what people are
شكرا كورونا ، كشفت معادن وايد من الناس
 There are no corona cases today in jorden
 و لله الحمد لم تسجل اي حالات اصابه بفيروس كورونا في الاردن
Table 2. Sample of the pre-processed tweet dataset.

Topic Modeling with BERT

BERT . Unlike the most popular ChatGPT next-word prediction autoregressive language model, BERT learns bidirectional contextualized representation for each text fragment or token. This is done by selecting 15% of tokens for (1) 80% masking, (2) 10% randomly replaced, and (3) the rest 10% untouched (original token). The model is trained to predict the missing words based on the surrounding context.

This first pretext task objective of BERT is Mask Language Modeling (MLM). However, MLM is a token-level objective and still, there is the need to learn sentences level representation. To learn these, a Next Sentence Prediction (NSP) is proposed as the second objective to trained BERT. However, the authors of RoBERTa argue that NSP objectives do not help the downstream task. Note that, RoBERTa uses data augmentation like for masking strategies via dynamic masking while the model is in the training mode and unlike BERT where the Mask is prepared during the pre-processing stage.

Arabic BERT . ArabBERT is a pre-trained Arabic language model based on the same BERT we mentioned above. The original BERT was trained using 3.3 Billion words total with 2.5B from Wikipedia and 0.8B from BooksCorpus. However, since the Arabic Wikipedia are smaller than the English ones, the authors of ArabBERT used a combination of different Arabic corpora to train the model as follows:

The final size of the collected datast afrer pre-processing is 70 million sentences, corresponding to ∼24GB of text.

BERTopic . BERTtopic is just a framework that leverage different topic modeling on top of BERT embedding (pre-trained BERT). The extracted embedding or dense representations is being clustered using a variation of topic modeling techniques such as (Uniform Manifold Approximation and Projection for Dimension Reduction) UMAP and HDBSCAN.

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP). is a dimensionality reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. UMAP bulids a neighborhood graph in the original space of the data and tries to find a similar graph in the low dimensional space. The main idea is to construct a graph and then map this graph to lower dimension.

Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN). is a density-based clustering algorithm that is based on the density of the data points. The algorithm is based on the idea that the data points that are in a dense region are part of the same cluster. The algorithm starts with all the data points in a cluster of their own. Then, it iterates over all the data points and merges them with the cluster to which they are most similar. The similarity between two clusters is defined as the minimum distance between any two data points in the two clusters. This Clustering algorithm is based on the idea that the data points that are in a dense region are part of the same cluster. The algorithm starts with all the data points in a cluster of their own. Then, it iterates over all the data points and merges them with the cluster to which they are most similar. The similarity between two clusters is defined as the minimum distance between any two data points in the two clusters. HDBSCAN is an improved version of DBSCAN. DBSCAN relies on the concept of density to identify clusters. Unlike K-mean clustering, DBSCAN does not require the number of clusters to be specified beforehand. In particular, DBSCAN relies on two parameters to identify clusters: (1) The redius of the circle around each core point ( \(\epsilon \) epsilon) and (2) the minimum number of data points required inside the circle (minPoints). We explain this in more detail below (Figure 3).

Trulli
Fig 3. DBSCAN. The minPts = 4. Point A has 4 points in its ε radius and is a core point. Points B and C are not core points but they are close to point A and they have at least one point, therefore, we call them border points. Point N is a noise point that is neither a core point nor a border point.

Implementation

This blog post is inspired by the pilot study of applying BERTopic for MSA standard Arabic. However, here we will apply different BERTtopic clustering based topic modeling techniques (UMAP, HDBSCAN) on the Arabic tweets dataset.

Result and Discussion

 Topic (AR/EN)  Score
 عناصر member  0.00573665129108314
 المتهم accused  0.00552159281093779
 الشرطة police  0.003955431150969441
  القضائية judgment  0.0050212182104924
  اعتقال arrested  0.00477325892023546
  المخدرات drug  0.00416250245164534
Table 4. Most frequent topic generated by UMAP.
 Model  Coherence Score
 LDA  0
 UMAP  0.62195
 HDBSCAN  0.63274
 UMAP+HDBSCAN  0
Table 3. Comparsion Results with different model.