Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature

Bsoul Qusay; Salam Rosalina Abdul; Atwan Jaffar; Jawarneh Malik

doi:10.1633/JISTaP.2021.9.4.2

P-ISSN2287-9099
E-ISSN2287-4577
SCOPUS, KCI

Home

OA policy

ISSN : 2287-9099

Article Contents

1. INTRODUCTION
2. TEXT CLUSTERING FOR ARABIC LANGUAGE AND THE QURAN
3. TEXT CLUSTERING PROCESS
4. PROPOSED QURAN THEME CLUSTERING APPROACH
5. CONCLUSION
ACKNOWLEDGMENTS
CONFLICTS OF INTEREST
REFERENCES

Prev Next

e-Submission Special Issue

Vol.9 No.4

PubReader Citation Share

Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature

JOURNAL OF INFORMATION SCIENCE THEORY AND PRACTICE / JOURNAL OF INFORMATION SCIENCE THEORY AND PRACTICE, (P)2287-9099; (E)2287-4577

2021, v.9 no.4, pp.15-34

https://doi.org/10.1633/JISTaP.2021.9.4.2

Qusay Bsoul (Cybersecurity Department, Science and IT College, Irbid National University, Irbid, Jordan)

Rosalina Abdul Salam (Faculty of Science and Technology, Universiti Sains Islam Malaysia, Nilai, Negeri Sembilan, Malaysia)

Jaffar Atwan (Prince Abdullah Bin Ghazi Faculty of ICT, AL-Balqa Applied University, Salt, Jordan)

Malik Jawarneh (Faculty for Computing Sciences, Gulf College, Muscat, Sultanate of Oman)

Bsoul, Q., Salam, R. A., Atwan, J., & Jawarneh, M. (2021). Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature. , 9(4), 15-34, https://doi.org/10.1633/JISTaP.2021.9.4.2

copy

Abstract

Text clustering is one of the most commonly used methods for detecting themes or types of documents. Text clustering is used in many fields, but its effectiveness is still not sufficient to be used for the understanding of Arabic text, especially with respect to terms extraction, unsupervised feature selection, and clustering algorithms. In most cases, terms extraction focuses on nouns. Clustering simplifies the understanding of an Arabic text like the text of the Quran; it is important not only for Muslims but for all people who want to know more about Islam. This paper discusses the complexity and limitations of Arabic text clustering in the Quran based on their themes. Unsupervised feature selection does not consider the relationships between the selected features. One weakness of clustering algorithms is that the selection of the optimal initial centroid still depends on chances and manual settings. Consequently, this paper reviews literature about the three major stages of Arabic clustering: terms extraction, unsupervised feature selection, and clustering. Six experiments were conducted to demonstrate previously un-discussed problems related to the metrics used for feature selection and clustering. Suggestions to improve clustering of the Quran based on themes are presented and discussed.

keywords: text mining, Arabic text clustering algorithms, terms extraction, un-supervised feature selection, optimal initial centroid

1. INTRODUCTION

The use of text clustering as an analysis tool for detecting and understanding Arabic text enables people to understand Quran text. Users can start understanding the Quran by going through the words or verses. The Quran consists of thirty chapters and each chapter includes many Quranic verses. Each chapter belongs to more than one theme such as Zakat (alms), Inheritance, Prayers, and others. Verses of each chapter are required to be grouped based on the clustering of themes. Therefore, Arabic text clustering becomes very important as an analysis tool for everyone in understanding the Quran. This is the motivation for clustering the themes of the Quran. The main reason why text clustering has not yet been used to group the Quran text is the degraded poor performance of the three tasks: terms extraction, unsupervised feature selection. Some tasks use text mining, such as sentiment analysis (Touahri & Mazroui, 2021), Twitter topic detection (Mottaghinia et al., 2021), concept map construction (Qasim et al., 2013), and systematic review support (Ananiadou et al., 2009), but it is still of great significance to increase the performance of text clustering. The Quran is an essential guide to the life of all humankind, both Muslims and non-Muslims. It covers all aspects of human life, including biology, information technology, law, social order, politics, business, economics, and individual responsibilities (Rostam & Malim, 2021). In short, the Quran represents a sea of knowledge (Yauri et al., 2013).

The great knowledge incorporated in the Quran stresses the importance of exploring such holy texts using automated applications. In fact, the information technology is a system consisting of a medium, infrastructure, and methods for gaining, transferring, accessing, interpreting, saving, organizing, and using data meaningfully (Azad & Deepak, 2019). People who seek to understand Islam could find easier ways to find answers to their various queries (Abualkishik et al., 2015). However, understanding the Quran requires more than just reading its translation. Hence, making the content of the Quran available to everyone is still a big challenge. Therefore, it is important to take one’s level of experience into account when presenting the content of the Quran.

The aim of this study is to investigate the limitation of text clustering, because there is a need to increase its quality. Although text clustering has been studied for many years, it is still an important research domain and its methods require further improvements (Alghamdi & Selamat, 2019). The rest of the paper is structured as: Sections 2 and 3 provide a review of text clustering and the problems of its application. However, Section 4 presents possible research directions and the proposed solutions for extraction, unsupervised feature selection, and clustering. Finally, Section 5 concludes the paper.

2. TEXT CLUSTERING FOR ARABIC LANGUAGE AND THE QURAN

Text clustering is one of the most commonly used methods for detecting themes or types of documents (Alghamdi & Selamat, 2019). It involves three main processes (Bsoul et al., 2014, 2016a, 2016b): The first process is document pre-processing, which is needed for any type of document such as finance, medical, or law documents. It removes unimportant words and symbols from the documents. The second process is to extract the most important terms and weights from the features of the documents then calculate the similarities among them. Then, feature selection is employed to select the optimal features for the clustering algorithm. The last process is clustering. There are two main types of cluster techniques, hierarchical and partitioning, as shown in Fig. 1. The organization of the paper is shown in Fig. 2. This work focuses on the domain of Quran themes and the main weaknesses of Arabic text clustering.

Fig. 1

Text clustering structure. BOW, Bag of Words; NER, named entity recognition; TFIDF, term frequency–inverse document frequency; DNA, deoxyribonucleic acid; SOM, self-organizing map; AP, affinity propagation.

Fig. 2

Organization of the paper. ADDC, average distance of documents to the cluster centroid; MAD, mean absolute difference.

In fact, there are a number of aspects that are involved in understanding the Quran, such as the reasons of revelation, the knowledge of Makkee and Madanee (that is, the places where a chapter “surah” was revealed), the knowledge of the various forms that were revealed, understanding the Quran’s abrogated rulings and verses, the various classifications of its verses, the knowledge of the themes, and the analysis of its grammar (Harrag, 2014). Recently, many studies have explored how Quranic text can be stored, processed, and extracted to further understand its content. The Quranic ontology (Beirade et al., 2021) and other approaches can be found in Raharjo et al. (2020) and Farhan et al. (2020). Other studies have used information retrieval (IR) (Raharjo et al., 2020; Safee et al., 2016) or text classifiers to detect verses or chapters in the Quran (Safee et al., 2016). The main objective of this study is to provide people with better solutions to understand the Quran and to disseminate knowledge using recent methods and technology. Hence, this study focuses on employing Arabic text clustering to cluster the themes of the Quran based on the highest similarity between the verses. In the next section, the process of Arabic text clustering and its limitations are presented in detail.

3. TEXT CLUSTERING PROCESS

This section reviews comprehensive previous related work, namely the three phases of text clustering: terms extraction, unsupervised feature selection, and clustering.

3.1. Terms Extraction

The extraction of Arabic words is one of the processes undertaken in Arabic Cluster (Alghamdi & Selamat, 2019), which functions to extract the most important terms needed to retrieve pertinent information. Consequently, various techniques have been suggested in the literature for such purposes, which are categorized into three main categories: Bag of Words (BOW), Bag of Concepts (BOC), and Bag of Narratives (BON) (Saloot et al., 2016). Several researchers have focused on extracting information from the terms by employing named entity recognition (Mesmia et al., 2018), bag-of-words (BOW) (Hmeidi et al., 2008), n-grams (Al-Salemi & Aziz, 2011), and lemmatization algorithms (Al-Shammari & Lin, 2008; Al-Zoghby et al., 2018). To make the terms extraction stage become effective, researchers used ontology-based extraction (Al-Zoghby et al., 2018), semantic extraction (Al-Zoghby et al., 2018), Arabic word sense disambiguation (Elayeb, 2019; Salloum et al., 2018), semantic word embedding (El Mahdaouy et al., 2018), and semantic relationships (Benabdallah et al., 2017). However, text data often presents challenges because of their high dimensionality and ambiguous or overlapping word senses. Previous studies have proposed various terms extraction methods such as BOW, N-grams, and named entity recognition to address the high dimensionality of the terms. These methods are called syntactic extraction. Overlapping word senses like the semantic words found in Word-Net are called semantic extraction. In this paper, we will discuss syntactic extraction and why researchers are moving towards developing and using semantic extraction.

Al-Smadi et al. (2019) have worked specifically on extraction based on morphological, syntactic, and semantic extract Arabic terms using supervised machine learning, which is founded that the syntactic better than morphological, and semantic. Besides this, Harrag (2014) employed Named entity approach to extract the most important Arabic terms, this approach have limitation in distinguished between two similar sentences. For that limitation Helwe and Elbassuoni (2019) their approach relies only on word embedding from the Arabic name entity to made difference between two similar sentences.

Al-Salemi and Aziz (2011) have assessed the BOW method and three levels of N-gram (3, 4, and 5) for extraction, whereby the outcomes obtained via Naïve Bayes classifier have revealed the BOW to outperform all of the N-gram levels tested. Similarly, Mustafa (2005) has positioned a technique for Arabic text searching in Arabic IR by implementing two methods for information extraction, namely contiguous N-grams and hybrid N-grams; the hybrid N-grams are better than contiguous N-grams and both methods extracted the nouns. Furthermore, Al-Shammari and Lin (2008) have stemmed nouns and verbs by using Educated Text Stemmer (ETS) as the stemmer for the roots of Arabic words. The algorithm is aimed towards noun and verb selection from Arabic documents according to prepositions, as well as certain rules regarding other linguistic elements, such as the definite article “the.”

In their previous work on the second stage of BOC (Al-Smadi et al., 2019; Cambria & White, 2014; Menai, 2014), their models were employed to extract the semantic relations between Arabic words based on the stage of BOW. Al-Smadi et al. (2019) employed morphological, syntactic, and semantic processes as Arabic terms extraction and in their result they mentioned the worst result using semantic extraction. Another problem in terms extraction is that additional non-related words are extracted (Al-Smadi et al., 2019; Al-Zoghby et al., 2018; Menai, 2014); these extra words are termed as highly dimensional features and there is a need to reduce the number of features.

In contrast, using syntactic methods to extract terms can hide the core meaning that is the main meaning of the sentence. This leads to overlapping meanings among different terms such as “bank,” which has ten meanings: for example, a financial institution or sloped land. This overlap in meaning is the main problem of syntactic extraction. Semantic approach can overcome this weakness of syntactic extraction (Al-Zoghby et al., 2018; Elayeb, 2019; Salloum et al., 2018). El Mahdaouy et al. (2018) proposed a word embedding semantic extraction for IR that outperforms BOW. Benabdallah et al. (2017) proposed a method for ontology relation and complex term extraction. In their research a complex term consists of four words. The method was compared with BOW and their proposed method showed an improvement in recall and precision. However, the overall performance of the IR showed that BOW outperformed their method. Table 1 summarizes previous work on terms extraction.

Table 1

Previous works on extraction

Author	Dataset used	Baseline extraction	Outperform extraction	Domain
Al-Shammari & Lin (2008)	Dataset manual	Khoja and Larkey stemmers	Noun with verbs for stemmer	Arabic clustering
Al-Salemi & Aziz (2011)	TREC-2002	N-gram 3, 4, and 5 level	BOW	Arabic classifiers
Helwe & Elbassuoni (2019)	Arabic Wikipedia corpus	Name entity and their system	Name entity and their our system are equivalent	Arabic classification
El Mahdaouy et al. (2018)	Arabic TREC collection	BOW	Word embedding similarities	Information retrieval
Benabdallah et al. (2017)	Dataset manual	Ontology ‘synonym, antonym, hypernym’ and complex extraction	BOW	Marker learning algorithm

[i]

BOW, Bag of Words.

To improve Arabic clustering performance with a reduced number of features and extracting a subset of the disambiguated terms with their relations with high accuracyis highly desirable. Hence, the next sections discuss the extraction of the core features (feature selection) and the clustering problem.

3.2. Unsupervised Feature Selection

Feature selection is a way to alleviate the curse of dimensionality. It reduces dimensionality by eliminating unnecessary and redundant features from the problem, which in turn improves the learning performance. Feature subset selection is a common problem in text clustering (Alweshah et al., 2021; Mafarja & Mirjalili, 2017) because of the high dimensionality of text in documents. Therefore, feature selection is necessary to reduce text dimensionality and to select a reasonable number of high quality features that affect performance. The development of new approaches to handle feature selection and the curse of dimensionality is still an active area of research, particularly for text clustering. The purpose of feature selection includes performance improvement such as accuracy, data visualization and simplification for model selection, and dimensionality reduction to remove noise and irrelevant features (Mafarja & Mirjalili, 2017).

The selection of the features and distribution of the data have a high impact on the performance of clustering algorithms (Abualigah et al., 2016) and text clustering processes such as extraction (Harrag, 2014). The latter tends to obtain local minima rather than the global minimum. The obtained results are often very good, especially when the initial features are fairly far apart. This is because the algorithm can usually distinguish the main category or class in a given data. Moreover, a clustering algorithm’s main process and the quality of extraction methods are both affected by the initial feature selection. Thus, the initial features can enhance the quality of the results (Mafarja & Mirjalili, 2017). For instance, failure of the clustering algorithm to recognize the features of the main category in certain data sets is possible if the features are close or similar. This failure can also occur particularly if the feature selection algorithm is left uncontrolled such as in the filter method (Zhang et al., 2019). The feature selection is split into two parts, filter and wrapper. The filter method ignores feature dependencies (Abualigah et al., 2016; Ahmad et al., 2021; Mafarja & Mirjalili, 2017; Zhang et al., 2019). The wrapper method using optimization as feature selection can introduce a good initial feature selection and leads to a better performance when refining the features and finding the optimal feature selection (Zhang et al., 2019). Table 2 shows the summary of previous work on unsupervised feature selection.

Table 2

Previous work on unsupervised feature selection

Reference	Dataset used	Baseline algorithm	Outperform algorithm
Tabakhi et al. (2014)	UCI dataset	Laplacian score, Term variance, Random subspace method, Mutual correlation, and Relevance-redundancy feature selection	Ant colony optimization as an unsupervised feature selection
Abualigah et al. (2016)	Text dataset	Harmony Search, an unsupervised feature selection and without feature selection	Genetic algorithm an unsupervised feature selection
Zhang et al. (2019)	UCI datasets and text data	UFS, and ant colony optimization	Particle swarm optimization an unsupervised feature selection
Abualigah et al. (2016)	Text dataset	Genetic algorithm, and Harmony Search	Particle swarm optimization an unsupervised feature selection
Abualigah & Khader (2017)	Text dataset	Genetic algorithm, Harmony Search, particle swarm optimization	Hybrid particle swarm optimization an unsupervised feature selection

[i]

UCI, University of California, Irvine; UFS, unsupervised feature extraction.

However, only a few studies use optimization as unsupervised feature selection. Tabakhi et al. (2014) employed ant colony optimization as an unsupervised feature selection method for classifiers and used the absolute value (Bharti & Singh, 2014) as the similarity to optimize between features. The efficiency and effectiveness of the ant colony optimization was better than other feature selection methods used for English classifiers. Abualigah et al. (2016) employed a genetic algorithm for feature selection and used the mean absolute difference (MAD) (Bharti & Singh, 2014) as the similarity between features. They compared their proposed method with k-means clustering without feature selection and showed that their proposed method increases the performance of English text clustering. Subsequently, Zhang et al. (2019) have implemented particle swarm optimization unsupervised feature selection combined with filter approach; they compared the proposed approach with filter approach and the PSO. Their results mentioned that the particle swarm optimization with filter approach was better than others and reduced the number of features in English clustering. In contrast, employing the particle swarm optimisation (PSO) for feature selection by Abualigah et al. (2016) employed particle swarm optimization (PSO) for feature selection using MAD (Bharti & Singh, 2014). Their results showed that the proposed feature selection method outperforms other feature selection methods such as genetic and harmony search algorithms for English clustering. Finally, Abualigah and Khader (2017) employed a hybrid of a PSO algorithm and genetic algorithm for feature selection using MAD (Bharti & Singh, 2014). Their results showed that the proposed feature selection method outperforms PSO and other methods for English clustering. The main gap related to the previous work is the fitness function used, that is, the internal evaluation. The fitness function used by previous work does not have a relation between the internal evaluations metric, MAD, and the external evaluations metric, the F-measure. Theoretically the best score of the internal method should get the best score of the external method. However, in the experiments that we conducted and from the previous work, this does not happen.

The Arabic datasets used and BOW were applied as an extraction method using harmony search feature selection (HSFS) and then harmony search clustering (HSClust) to evaluate the F-measure. This was the first work that employed multi-objective optimization for two problems with the objective functions of MAD and best average distance of documents to the cluster centroid (ADDC), using HSClust with the best parameters from HSFS (Forsati et al., 2013) for Arabic clustering. The number of features are represented by the recommended algorithm in which all the features collection is codified in a vector of length m, where m represents the feature number, as demonstrated in Table 3. Every component of such vector is regarded as a label where the features are dropped or chosen. Table 3 shows the representation of features for three different solutions with their cost function of MAD. Solution 3 gave the highest score. In this example, twelve features {1, 2, 5, 9, and 11} are chosen while the others {3, 4, 6, 7, 8 10, 12} are dropped and so forth.

Table 3

Representation of features

Solutions	Features (m)												MAD cost function
Solutions	1	2	3	4	5	6	7	8	9	10	11	12	MAD cost function
Solution 1	1/S	1/S	0/N	0/N	1/S	0/N	0/N	0/N	1/S	0/N	1/S	0/N	0.59
Solution 2	0/N	0/N	1/S	1/S	0/N	1/S	1/S	1/S	0/N	1/S	1/S	0/N	0.55
Solution 3	1/S	1/S	0/N	1/S	0/N	0/N	1/S	0/N	1/S	1/S	1/S	0/N	0.66

[i]

MAD, mean absolute difference; 0, non-selected; 1, selected.

On the other hand, the Harmony Search (HS) as a cluster employs certain representations to code the document. All of the partition clusters, along with a vector of length n that denotes the number of documents, are presented in Table 4. For this vector, every element acts as the label that a single document belongs to. For instance, if the total number of clusters is represented by K, then each solution vector’s element gives an integer value that falls in the range of [K]={1,…, K}. In this example the cost function of ADDC was used. Four documents {2, 3, 8, and 10} originate from the cluster that has been assigned Label 1. The cluster assigned as Label 3 includes three documents {1, 7, 12} and the same goes for the other groups. The best score for ADDC cost function is the lowest value. The Pseudo code of harmony search as a feature selection is shown in Fig. 3 and the Pseudo code of harmony search as cluster is shown in Fig. 4.

Table 4

Representation of clusters as groups from 1 to 5

Solutions	Documents (n)												ADDC cost function
Solutions	1	2	3	4	5	6	7	8	9	10	11	12	ADDC cost function
Solution 1	3	1	1	4	5	2	3	1	4	1	4	3	0.11
Solution 2	2	3	4	5	4	5	2	3	3	2	5	1	0.33
Solution 3	1	3	2	3	3	4	1	2	1	4	3	2	0.29

[i]

ADDC, average distance of documents to the cluster centroid.

Fig. 3

Pseudo code of harmony search as a feature selection.

Fig. 4

Pseudo code of harmony search as a clustering.

The Arabic dataset was taken from (Al-Salemi & Aziz, 2011); more details about the dataset are in Section 4.1. Fig. 5 shows the results of 1,000 iterations for HSFS: each iteration uses new feature selections and uses HSClust to evaluate the F-measure. Based on our experiment, the internal MAD cost function is used as a metric that is the best one, and has the highest value. However, it does not represent the highest F-measure value. For instance, at iteration 1 the MAD metric in HSFS was around 0.102 and the value of F-measure is around 0.84. In contrast, at iteration 560, the MAD metric in HSFS was around 0.167, which is better than at iteration 1, but the F-measure value was 0.56, which is worse. The same experiment was conducted for ADDC as shown in Fig. 6. The internal ADDC metric that is the best one and has the lowest value does not represent the highest F-measure value.

Fig. 5

Comparison of (A) MAD and (B) the F-measure using harmony search for feature selection and clustering. MAD, mean absolute difference.

Fig. 6

Comparison of (A) ADDC and (B) the F-measure using harmony search for clustering. ADDC, average distance of documents to the cluster centroid.

3.3. Clustering Algorithms

The clustering process (Mehra et al., 2020) is comprised of grouping objects based on similarity. Such a method includes two essential kinds of clustering, namely, partitioning and hierarchical. The approaches of hierarchical clustering are believed to be better; nevertheless, at the start of the clustering process, they cannot usually recognize documents which are expected to be misclassified (Yahya, 2018). In addition, the complexity of time of hierarchical approaches is quadratic in terms of the number of objects of data (Wu et al., 2018). The approaches of partitioning clustering are better due to their fairly low computational complexity that enables them to deal with quite big datasets (Wu et al., 2018). The k-means approach has a crucial role in partitioning clustering (Bsoul et al., 2016), and it also has been utilized in partition-based clustering with linear time complexity (Wu et al., 2018). Furthermore, Hartigan (1981) stated that the essential aim of the k-means algorithm is that the documents mean assigned to such a cluster is used to symbolize each of the k clusters. Such a mean is named the cluster centroid. However, the k-means algorithm does not have sensitivity to the initialization and a priori clusters are needed. Moreover, the primary centroids play an essential role in the performance of clustering and might lead the algorithm to be stuck in a locally optimum solution (Selim & Ismail, 1984).

A number of researchers have conducted empirical research on the algorithms of clustering on different data sets while others have carried out research to identify the number of clusters or best clusters. One instance of identifying the best cluster is Dai et al. (2010a), which streamlined agglomerative hierarchical clustering by considering the significance of a document title. When the term existed in the title, it was assigned as a higher weight. The findings of their studies revealed that the proposed approach was efficient in clustering documents related to financial news. Nevertheless, some studies have investigated clustering themes. For instance, Bação et al. (2005) utilized self-organizing maps (SOMs) in order to cluster images “IRES,” for clustering data (as opposed to the clustering of text), and compared their approach with k-means. It was revealed that the SOM performance is better than that of k-means.

Other studies like Shahrivari and Jalili (2016) have made a comparison among complete single pass cluster, k-means, and others cluster algorithms in a data set. The findings of these studies revealed that k-means is more efficient than the other algorithms. Besides this, Bouras and Tsogkas (2010) utilized clustering approaches such as maximum, single, and centroid linkage hierarchical clustering, k-medians, regular k-means, and k-means++. The results of their study revealed that using k-means does not only produce the best results in terms of the internal metric of the clustering index function, but also provides better results on the experimentation of real users.

Other studies have also made a comparison among single-pass algorithms, k-means, and other algorithms for news topics clustering. For example, Jo (2009) found that k-means performed better than single-pass clustering. Besides this, Dai et al. (2010a) examined hybrid algorithms for clustering and presented a two-layer text clustering method which can detect the themes of retrospective news employing affinity propagation (AP) clustering. Furthermore, initial layer text clustering studies utilized AP clusters so as to produce the number of groups. Subsequently, they used common agglomerative hierarchical clustering to produce the final themes of news. Finally, such studies adopted classic k-means and usual agglomerative hierarchical clustering (AHC) as comparative approaches. The results showed that the proposed approach obtained the highest precision, followed by AP clustering, k-means clustering, and AHC, respectively. In terms of recall, it was revealed that the proposed approach as well as k-means achieved the highest results followed by the AHC as well as AP clustering approaches.

Velmurugan and Santhanan (2011) also made a comparison among three clustering algorithms on a geographic map data set. The findings of their study revealed that k-means was better with small data sets; k-medoids was better with huge data sets, and fuzzy c-means obtained qualitative results, which fell between those of k-means as well as k-medoids. Dueck and Frey (2007) also proposed AP clustering that could automatically produce many clusters. In fact, AP performs better than k-means in terms of precision, based on the findings of Dai et al. (2010a). However, an assessment of F-measure showed that k-means performs better than AP due to the fact that it has better recall. In addition, Qasim et al. (2013) made a comparison among four clusters utilizing 65 documents as a dataset. The findings of this study revealed that the best clustering approach is AP and then spectral, hierarchical, and ordinary k-means, respectively. Table 5 provides a summary of the studies on clustering.

Table 5

Previous work on cluster algorithms

Reference	Dataset used	Baseline algorithm	Outperform algorithm
Dai et al. (2010a)	Financial news	AHC	Enhance AHC
Bação et al. (2005)	IRIS	k-means	SOM
Shahrivari & Jalili (2016)	KDD	Scalable, complete k-means	k-means
Bouras & Tsogkas (2010)	Web	Single, maximum, centroid AHC, and k-medians, k-means++	k-means
Dueck & Frey (2007)	News	k-means, AHC, AP	APAHC
Velmurugan & Santhanan (2011)	Geographic map	Fuzzy c-means	k-means, k-medoids
Qasim et al. (2013)	65 documents news	Spectral, Hierarchical and K-means	AP

[i]

IRIS, flower dataset; KDD, knowledge discovery and data mining; AHC, agglomerative hierarchical clustering; AP, affinity propagation; SOM, self-organizing map; APAHC, affinity propagation agglomerative hierarchical clustering.

Furthermore, Silhouette Width (SW) (Alghamdi & Selamat, 2019; Campello & Hruschka, 2006) can be used to detect the numbers of clusters. A higher value of SW justifies better discrimination among clusters. Nevertheless, the biggest values of SW justify the best clustering (number of cluster). Numbers ranging between 2 and 10 are used to obtain the number of optimal cluster. When two clusters are recognized, the best SW value reached is 0.1710, but it is not helpful to discover the trend of two categories of crime because the number of clusters is insufficient to be analyzed. Thus, they utilized four clusters instead of two clusters.

Qasim et al. (2013) used AP clustering method to generate an optimal number of clustering and grouped 65 documents as datasets. However, they did not compare SW and Bisecting k-means with other algorithms, and they did not used big datasets to show the performance of AP on bigger datasets than 65 documents. We did three experiments to address the problem related to three clustering algorithms, namely: Spherical k-means, AP, and k-means cluster. The results of experiments for Arabic clustering are shown in Figs. 7, 8, and 9 and it illustrated the problem of local optima. The results showed that 1,000 independent runs can produce either good or bad performance. The performance of the k-means algorithm is mainly affected by the specified clusters number as well as the random selection of primary cluster centers as revealed in the previous experiments. The means of Spherical k-means were worse than AP and k-means while AP outperform the Spherical k-means and k-mean as illustrated in Table 6. Nevertheless, it does not work on huge data as noticed in Fig. 10. Thus, the present study focuses on dealing with the latter issue by proposing efficient algorithms so as to generate results that are less reliant on the selected primary cluster centers, and thus are more stabilized. Most of the existing work utilized Harmony Search optimization as clustering. Such studies are mainly conducted for English. However, similar studies have not been conducted for Arabic.

Table 6

Experiment to compare between five cluster algorithms using internal evaluation

Cluster algorithm	Best case	Worst case	Different
HSCLust	0.7815	0.783	0.002
HKM	0.4	0.4	0
K-means	0.401	0.478	0.077
SPK-means	0.484	0.522	0.038
AP	0.26825	0.26925	0.001

[i]

HSCLust, harmony search clustering; HKM, hadoop K-means; SPK-means, spherical K-means; AP, affinity propagation.

Fig. 7

BOW extraction using k-means for 1,000 independent runs. BOW, Bag of Words; ADDC, average distance of documents to the cluster centroid.

Fig. 8

BOW extraction using spherical k-means for 1,000 independent runs. BOW, Bag of Words; ADDC, average distance of documents to the cluster centroid.

Fig. 9

BOW extraction using affinity propagation for 1,000 independent runs. BOW, Bag of Words; ADDC, average distance of documents to the cluster centroid.

Fig. 10

The main drawback related to affinity propagation.

The choice of the primary cluster centers as well as the data distribution has a great effect on k-means and other algorithms. Therefore, it could get stuck in local search rather than global search. The achieved results are usually very good in most cases, particularly when the centroids of initial centers cluster are sufficiently selected far apart, where it would be usually possible to differentiate the major clusters in a certain data set. Moreover, the essential processes of k-means apart from the final partition of the dataset quality are both influenced by the cluster centroids initialization. Therefore, the first points have a great role in the quality of results. For example, the failure of the k-means algorithm to recognize the main clusters features in a certain data set is possible if they are similar or close and particularly if the k-means algorithm is left unsupervised. In addition, to be less reliant on initialization of a certain dataset, and to improve the performance of k-means algorithm, it is important to incorporate the k-means algorithm with an optimization. This will allow better initial clustering centroids, improve the number of the clustering centroids refinement, and recognize the centers of optimal clustering (Gbadoubissa et al., 2020; Mehra et al., 2020).

Meta-heuristics optimization, is widely known to be effective. There are two types of popular meta-heuristics optimization, namely, single-solution based meta-heuristics, also known as trajectory methods, and population-solution based meta-heuristics. Single-solution based meta-heuristics start with a single initial solution and navigate the search space as a trajectory. The main trajectory methods include simulated annealing, Tabu search, the Greedy randomized adaptive search (GRASP) method, variable neighborhood search, guided local search, iterated local search, and their variants. In contrast, population-solution based meta-heuristics do not handle a single solution but rather a set of solutions that is a population. The most widespread population-based approaches are related to evolutionary computation and swarm intelligence (SI). Darwin’s theory of evolution provided the concepts needed for evolutionary computation algorithms, because it includes recombination and mutation operators, which allow a population of individuals to be modified by exploiting simple analogues of social interaction, rather than purely individual cognitive abilities. The main aim of SI is to create computational intelligence. The use of nature-inspired meta-heuristic algorithms has been widely adopted in several domains, including computer science (Kushwaha & Pant, 2021), data mining (Grislin-Le Strugeon et al., 2021), texture (Paci et al., 2013), agriculture (Cisty, 2010), computer vision (Connolly et al., 2012), forecasting (Kaur & Sood, 2020), medicine and biology (Arle & Carlson, 2021), scheduling (Guo et al., 2009), economy (Zheng et al., 2020), and engineering (Sayed et al., 2018).

Through the literature it was found that existing clustering algorithms are still inadequate for the Quran themes clustering. Mainly, this is due to the Arabic clustering. In text clustering, similarity measures for documents are utilized and many terms, or features, are specified for each document. In contrast, distance measures between points in a space are used by data mining. The dissimilarity between text mining and data mining has also been mentioned by Hearst (1999). Note that data mining uses non-textual data as opposed to text mining, which utilizes textual data (Guo et al., 2009). However, a few works involve optimization as clustering. Mahdavi and Abolhassani (2009) employed harmony search as an English clustering technique and merged it with k-means on diverse datasets. They found the best ADDC measure to be harmony k-means, when compared with harmony search clustering, k-means, genetic k-means, the Mises-Fisher generative model-based algorithm, and PSO clustering. Forsati et al. (2013) generated three versions of harmony search with k-means and evaluated 13 scenarios to choose the best parameters related to harmony search for English clustering. The best proposed method was related to the combination of k-means with harmony search as cluster. Other works used other type of meta-heuristics. Cagnina et al. (2014) employed Practical Swarm Optimization as English clustering method on short text datasets and compared it with k-majorClust, CLUDIPSO, k-means, and CHAMELEON as a local search clustering. The best algorithm was an improved version of PSO.

In general, existing methods are summarized in Table 1 for the extraction of Arabic clustering and Table 6 for English text clustering, respectively. The limitations are as follows:

(1) Syntactic extraction is not adequate for extracting information from Quran themes. Semantic meanings are useful for Quran themes. As an example: Extracting nouns and verbs with their semantic meanings, where events have nouns to describe locations, names, themes, dates, and events also have verbs to describe themes, types, and reasons while simultaneously avoiding the extraction of unimportant features (Atwan et al., 2014).
(2) k-means can be the best clustering algorithm (Jain, 2010). However, in the case of Quran themes, it is not sufficient. k-means with a hybrid method with harmony search or spherical k-means could be considered. Furthermore, other researchers have provided a solution by integrating with other algorithms such as SW (Alghamdi & Selamat, 2019; Campello & Hruschka, 2006) to determine the number of clusters. However, their findings for SW (Alghamdi & Selamat, 2019; Campello & Hruschka, 2006) and integrated algorithms (Dai et al., 2010b) showed that their approach can be further improved in determining the correct number of clusters.
(3) Internal evaluations of ADDC for clustering and internal evaluation of MAD for unsupervised feature selection are acceptable to measure optimal centroids and optimal features. However, our experiments in Sections 3.2 and 3.3 showed that the F-measure of this internal evaluation needs to be increased for Arabic text clustering. This can be seen in the experiments shown in Fig. 5 and Fig. 6.
(4) All the clustering-based works solved the challenges of clustering documents by enhancing part of the process, where the output of each stage affects the accuracy of the next stage (Atwan et al., 2014). Section 4.2 shows the disadvantages of k-means clustering where it can affect the evaluation of the extraction methods. In our study this is discovered in Arabic clustering. Therefore, there is a need to evaluate the extraction methods using optimization algorithms that are less dependent on initial centroids. However, existing works do not used optimization-based clustering for Arabic text clustering to evaluate Arabic text extraction. The focus of our research is on the Arabic optimization clustering and optimization as feature selection for Arabic clustering. This is similar to work that has been done in English optimization clustering and optimization as a feature selection for English clustering.
(5) In addition, previous approaches do not employ multi-objective optimization as feature selection and as clustering using the same algorithm. In other words, it is necessary to employ an optimization method to simultaneously solve two problems: feature selection and clustering.

As stated before, there are three processes involved in Arabic clustering, from the low-level process of clustering up to the high-level process of extracting terms from the Quran themes. As demonstrated in our experiments in Section 4.2, we managed to identify the weakness of extracting information from the Quran themes when addressing the problem of the k-means clustering algorithm. In addition, it is very important to extract words such as nouns and verbs with their semantic meaning related to the themes within the Quran, whereby themes or types in the Quran have nouns to describe information such as names, themes, dates, and locations, while verbs can be used to describe information such as themes, types, and reasons.

Based on the limitations discussed, we can conclude that existing methods for detecting and identifying Arabic text clusters have weaknesses, either with respect to the method of extraction, unsupervised feature selection, and clustering algorithm, or the method of internal evaluation. These are such as ADDC for clustering and MAD for unsupervised feature selection. The problems are related to clustering algorithms and the need to verify the effectiveness of a clustering algorithm by evaluating the extraction process. Therefore, results of previous methods should often be suboptimal (Aouf et al., 2008) and the problem is related to extracting the most important terms for Quran themes (El Mahdaouy et al., 2018). In addition, experiments on Arabic clustering need to be conducted to show the effect of optimization methods such as harmony search on extraction method performance for both Arabic as well as English.

In this study, all the weaknesses mentioned are shown through experiments in the next section; furthermore, we propose solutions to assist the detection and identification of the groups of themes of the Quran using Arabic text optimization for feature selection and clustering.

4. PROPOSED QURAN THEME CLUSTERING APPROACH

This section consists of three subsections, each containing a future direction to be employed for Quran theme clustering. The first section suggests the establishment of new real-text Dataset. The dataset for this study were taken from Al-Salemi and Aziz (2011). The second section recommends extracting the most important terms, as discussed in Sections 2 and 3.1, with respect to the semantic meaning of nouns and verbs. Section 4.2 describes the selection of the most important features using the black hole (BH) method. In addition, we use the BH method to cluster the data to assign the optimal initial centroids for Quran themes. Using this concept, we propose multi-objective optimization for text clustering, especially for Arabic text clustering.

4.1. Data Collection and Performance Measure

In the conduced experiments, unedited, modern, and unmarked Arabic text was utilized, consisting of a sample of almost 1,680 documents gathered from a number of Arabic online resources. The initial dataset is comprised of four categories, namely, economics, art, sport, and politics articles. Each includes documents collected from Al-Salemi and Aziz (2011). Table 7 provides a summary of the dataset along with features number of every category, as well as the total of 13,464 words. However, the second dataset includes a set specifically designed to assess the extraction of Arabic text for three domains, namely, Arabic classifiers, Arabic clustering, and Arabic IR created as part of TREC 2001. The set has 383,872 Arabic documents, mostly newswire dispatches issued by Agence France Press between the years 1994 and 2000. Ground truth and standard TREC queries have been created for such collection: 25 queries were considered as part of TREC 2001, and the collection of queries has matching relevance judgments produced utilizing the technique of pooling. Based on this, part of TREC 2001 is defined for classifiers and clustering as revealed in Table 8, which include ten classes along with the number of words (a total of 19,508 words) as well as the number of related documents.

Table 7

Details of dataset for cluster and classifier domains taken from Al-Salemi and Aziz (2011)

Categories	# DOCUMENTS	# features
Art	420	2,951
Economics	420	3,859
Politics	420	3,172
Sport	420	3,482

Table 8

Description of document dataset categories taken from TREC 2001

Classes	# documents	# features
1	383	1,457
2	315	2,013
3	246	1,277
4	222	2,607
5	174	1,402
6	179	1,683
7	393	2,733
8	321	2,169
9	242	1,086
10	556	3,081

4.2. Proposed Term Extraction Method

As mentioned in Section 3.1 regarding the weakness of extraction methods, we suggest using Word-Net for noun and verb extraction to extract information that provides the most important features for describing specific Quran themes and their semantic meaning. This can be done by identifying whether a term could be a verb or noun by examination of the stemmed feature that exists in the Arabic Word-Net verb or noun database, as shown in Fig. 11. Quran themes could be identified and detected by looking at nouns as well as verbs along with their semantic meaning. In addition, the redundant terms extracted as nouns and verbs would be removed. In contrast, the problems related to the clusters used to evaluate the extraction methods, as mentioned before, lead us to verify this new problem experimentally.

Fig. 11

Proposed method to extract nouns and verbs.

As noted before, Table 1 shows a summary of the research for extractions. The main gap in this process is the evaluation, which uses clustering search that depends on the initial cluster centroids of each cluster’s group. Previous researchers have used independent runs of Arabic text clustering and then taken the average; however, that is not sufficient. In fact, they did not consider the effect of clustering algorithms on extraction to make the comparisons impartial and fair to prove this. BOW has been applied as extraction method using spherical k-means, AP, and k-means. To evaluate the four clusters, we used ADDC as the evaluation metric. Figs. 7, 8, and 9 show the results of 1,000 runs for spherical k-means, AP and k-means, and clustering, which have different results, either positive or negative. Good results are close to zero and this happened to all of them. However, the results of k-means showed that k-means are very dependent on initial cluster centroids. The dependency is less with spherical k-means and followed by AP. It can be seen that the worst result of k-means was close to the best result of spherical k-means. Moreover, AP’s results were better than the others, but on pairs of dataset there were “two classes of the dataset.” However, on big dataset, this is not possible. The AP method needs an impractically large memory to normalize the dataset, and this experiment is shown in Fig. 10.

4.3. Black Hole Optimization as Unsupervised Feature Selection and Clustering Method

In the last two decades, many meta-heuristic methods have been used for data mining, including simulated annealing (Güngör & Ünler, 2007), Tabu search (Kharrousheh et al., 2011), genetic algorithms (Liu et al., 2012), ant colony optimization (Niknam & Amiri, 2010), the neural gas algorithm (Qin & Suganthan, 2004), honey bee mating optimization (Fathian et al., 2007), differential evolution (Das et al., 2007), PSO (Izakian & Abraham, 2011), artificial bee colony optimization (Karaboga & Ozturk, 2011), gravitational search (Hatamlou et al., 2012), binary search (Hatamlou, 2012), firefly optimization (Senthilnath et al., 2011), big bang–big crunch (Hatamlou et al., 2011), harmony search k-means (Forsati et al., 2013), and BH optimization (Hatamlou, 2013). However, in this study we conducted an experiment to show the weaknesses of harmony search with k-means proposed for English clustering by Forsati et al. (2013). We used the best parameter settings obtained by Forsati et al. (2013), and the pseudo code of harmony search as a cluster with k-means shown in Figs. 3, 4, and 12. Comparisons of the results of harmony search clustering (Fig. 13) and one-step k-means (harmony k-means) are shown in Fig. 14, where the combined method was better than harmony search clustering. This can be seen in both Fig. 13 and Fig. 14. However, results shown in Figs. 7, 8, 9, and 13 showed that k-means clustering at times can produce better results than harmony k-means. Therefore, another optimization method can be proposed to overcome this weakness.

Fig. 12

One step k-means with harmony search.

Fig. 13

Harmony search for 1,000 iterations using BOW extraction. BOW, Bag of Words; ADDC, average distance of documents to the cluster centroid.

Fig. 14

One step k-means+harmony search for 1,000 iterations using BOW extraction. BOW, Bag of Words; ADDC, average distance of documents to the cluster centroid.

To address the weakness of harmony search, we suggest employing BH optimization as a cluster (Hatamlou, 2013) for Quran themes clustering. Owing to its few common features with other population-based methods, the BH optimization algorithm is in fact a population-based method. A population of candidate solutions to a given problem is generated and placed randomly within the search space, as in other population-based algorithms that utilize certain mechanisms to gradually improve the population to obtain the optimal solution. In genetic algorithms, for instance, mutation and crossover operations help to achieve gradual improvements. By moving the candidate solutions around in the search space, this progress can be achieved in PSO using the best locations found so far, which are updated when the candidates reach better locations. However, by shifting all the candidates towards the best candidate at each iteration, namely, as in BH, and considering newly generated candidates in the search space instead of those candidates included within the current set of the BH, the population in the BH algorithm can grow. First, to solve the BH problem, researchers tend to utilize benchmark functions (Zhang et al., 2008). However, a new mechanism, known as the BH, has been introduced into PSO because of the method suggested in Zhang et al. (2008). A new particle is created randomly near the best particle at each iteration in this method. After this, the algorithm updates the locations of the particles either using PSO or a new mechanism that is mainly based on two randomly created numbers. To summarize, PSO has paved the way for this method. BH is a new particle used to augment the convergence speed of the PSO and limit the convergence to local optima. Moreover, a BH can attract other particles under certain conditions. The theme horizon of the BH and the destruction of stars (candidates) has not been tackled in this approach, but such optimization has been treated in Hatamlou (2013). The best candidate at each iteration in the BH algorithm is considered to be a BH; however, all the remaining candidates are considered to be normal stars. A BH is not randomly created; it is created by one of the real candidates of the population. After this, depending on their current location and a random number, all the candidates are shifted towards the BH. The suggested BH algorithm is beneficial mainly because of its simple structure and easy implementation. Moreover, it is free from parameter tuning issues (Hatamlou, 2013).

The main notion of utilizing the BH for feature selection is basically to generate an area of space features which have a big amount of concentrated mass. Thus, the potential of a nearby object feature escaping its gravitational pull for significant features is reduced. Therefore, it is believed that anything falls into a BH, among them light, which is eternally gone from the universe. The recommended BH algorithm begins with employing a primary population of candidate solutions to an objective function which is estimated for them as well as to an optimization problem. Therefore, the best candidate is selected in each iteration as the BH whereas the others comprise the normal features or normal stars. Subsequently, the process of initialization is complete, and the BH begins with pulling the stars around its feature. If the resemblance between the BH centroid and star (feature) is high, it will be swallowed by the significant features of BH and is gone everlastingly. In this case, a fresh star (candidate solution) is created randomly and put in the search space, followed by a new search. The steps involved are as follows:

Loop based on the number of BHs that will be assigned.
For each star “feature,” evaluate the objective function for each BH using MAD.
Select the best stars “features” that have the best fitness value as the BH important features for all BHs.
Change the location of each star “features” according to x_i(t+1)=x_i(t)+rand×(x_BH–x_i(t)) ... i=1, 2, 3, ..., N
If a star “feature” reaches a location with higher similarity than the BH, exchange their locations.
If a star “feature” crosses the theme horizon of the BH, replace it with a new star “feature” in a random location in the search space.
When a termination criterion (a maximum number of iterations or a sufficiently good fitness) is met, exit the loop.

In the above equation, x_i(t) and x_i(t+1) are the i-th star of the locations at iterations t and t+1, respectively. In addition, x_BH is the location of the BH in the search space. Variable Rand is a random number in the interval [0, 1], and N is the number of stars (candidate solutions).

BH has been proposed as feature selection and currently we are introducing BH as clustering. The BH as cluster started with a fresh star (candidate solution). The candidate solution is created randomly and put in the search space, followed by a new search. The steps involved are as follows:

Loop based on the number of BHs that will be assigned.
For each star “number of group,” evaluate the objective function for each BH using ADDC.
Select the best stars “number of group” that have the best fitness value as the BH of each documents/versus belongto theme/group for all BHs.
Change the location of each star “number of group” according to x_i(t+1)=x_i(t)+rand×(x_BH–x_i(t)) ... i=1, 2, 3, ..., Number of groups
If a star “number of group” reaches a location with higher similarity than the BH, exchange their locations.
If a star “number of group” crosses the theme horizon of the BH, replace it with a new star “number of group” in a random location in the search space.
When a termination criterion (a maximum number of iterations or a sufficiently good fitness) is met, exit the loop.

Fig. 15 shows the proposed method using BH as feature selection, which is then followed by employing BH as a cluster for Quran themes clustering. The possibility of a nearby object that is the Quran theme document escaping its gravitational pull (i.e., that of the clusters) is minimized. The proposed BH with the combination of k-means are shown in Fig. 16. The proposed BH as a feature selection reduced the unimportant features and the proposed BH as a cluster will select the optimal initial centroid of each theme.

Fig. 15

The proposed method.

Fig. 16

The proposed Black Hole and k-means clustering method.

5. CONCLUSION

The ultimate aim of this paper was to employ Arabic text clustering to cluster Quran themes into clusters. Hence, this study reviewed the necessary improvements to Arabic text clustering, and suggested possible research directions for improving Arabic text clustering with respect to extraction, feature selection, and clustering. In this review paper, the limitations related to Arabic text clustering were discussed and the limitations of existing works were demonstrated and presented through our experiments. For instance, the k-means algorithm finds the best clusters compared to other algorithms as shown in Figs. 7, 8, 9, and 10. But the k-means weakness is in the initial centroids. Therefore, we proposed to use BH as a cluster to create the optimal Initial centroids. Next, we examined the weakness of term extraction methods and conducted a number of experiments to demonstrate the problems identified in existing approaches. For this problem we suggested using nouns and verbs with their semantic meaning as a new extraction method for Quran theme clustering. This can provide better performance compared to the current approach.

We identified a new problem related to extraction depending on the clustering algorithm used, and for that we suggest the use of BH as feature selection and as clustering to evaluate the proposed extraction. We also noted the limitations of using AP clustering on a big dataset, and our suggestion is to use BH as feature selection. For the dataset we have generated and published real Quran theme dataset and Arabic text dataset, which are freely available online.¹

ACKNOWLEDGMENTS

This paper presents a work that is supported by the Ministry of Higher Education (MOHE), under the Trans-disciplinary Research Grant Scheme (TRGS-MOHE) (TRGS/1/2015/USIM/01/1).

CONFLICTS OF INTEREST

No potential conflict of interest relevant to this article was reported.

REFERENCES

Abualigah, L. M., & Khader, A. T. (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering The Journal of Supercomputing, 73(11), 4773-4795 .

Abualigah, L. M., Khader, A. T., Al-Betar, M. A., Altamimi, A., & Almasalha, F. (Eds.) (2016, July 13-14) Proceedings of the 7th International Conference on Computer Science and Information Technology Institute of Electrical and Electronics Engineers Unsupervised feature selection technique based on genetic algorithm for improving the text clustering, 1-6,

Abualkishik, A. M., Omar, K., & Odiebat, G. A. (2015) QEFSM model and Markov Algorithm for translating Quran reciting rules into Braille code Journal of King Saud University - Computer and Information Sciences, 27(3), 238-247 .

Ahmad, R., Wazirali, R., Bsoul, Q., Abu-Ain, T., & Abu-Ain, W. (2021) Feature-selection and mutual-clustering approaches to improve DoS detection and maintain WSNs' lifetime Sensors (Basel, Switzerland), 21(14), 4821 . Article Id (pmcid)

Alghamdi, H. M., & Selamat, A. (2019) Arabic web page clustering: A review Journal of King Saud University - Computer and Information Sciences, 31(1), 1-14 .

Al-Salemi, B., & Aziz, M. J. A. (2011) Statistical Bayesian learning for automatic Arabic text categorization Journal of Computer Science, 7(1), 39-45 .

Al-Shammari, E., Lin, J., Lopresti, D., Roy, S., Schulz, K., & Subramaniam, L. V. (Eds.) (2008, July 24) Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data Association for Computing Machinery A novel Arabic lemmatization algorithm, 113-118

Al-Smadi, M., Al-Ayyoub, M., Jararweh, Y., & Qawasmeh, O. (2019) Enhancing aspect-based sentiment analysis of Arabic hotels' reviews using morphological, syntactic and semantic features Information Processing & Management, 56(2), 308-319 .

Alweshah, M., Alkhalaileh, S., Albashish, D., Mafarja, M., Bsoul, Q., & Dorgham, O. (2021) A hybrid mine blast algorithm for feature selection problems Soft Computing, 25(1), 517-534 .

Al-Zoghby, A. M., Elshiwi, A., Atwan, A., Shaalan, K., Hassanien, A., & Tolba, F. (Eds.) (2018) Intelligent natural language processing: Trends and applications Springer Semantic relations extraction and ontology learning from Arabic texts-a survey, pp. 199-225

Ananiadou, S., Rea, B., Okazaki, N., Procter, R., & Thomas, J. (2009) Supporting systematic reviews using text mining Social Science Computer Review, 27(4), 509-523 .

Aouf, M., Liyanage, L., Hansen, S., Lee, V. C. S., Chen, J., Ng, W.-K., Ong, K.-L., & Tan, T. Y. (Eds.) (2008, June 30-July 2) Proceedings of the 2008 International Conference on Service Systems and Service Managament Institute of Electrical and Electronics Engineers Review of data mining clustering techniques to analyze data with high dimensionality as applied in gene expression data, 1-5,

Arle, J. E., & Carlson, K. W. (2021) Medical diagnosis and treatment is NP-complete Journal of Experimental & Theoretical Artificial Intelligence, 33(2), 297-312 .

Atwan, J., Mohd, M., Kanaan, G., Bsoul, Q., Jaafar, A., Ali, N. M., Noah, S. A. M., Smeaton, A. F., Bruza, P., Bakar, Z. A., Jamil, N., Mohd, T., & Sembok, T. (Eds.) (2014, December 3-5) Proceedings of the 10th Asia Information Retrieval Societies Conference Springer Impact of stemmer on Arabic text retrieval, 314-326

Azad, H. K., & Deepak, A. (2019) Query expansion techniques for information retrieval: A survey Information Processing & Management, 56(5), 1698-1735 .

Bação, F., Lobo, V., Painho, M., Sunderam, V. S., Albada, van, Sloot, P. M. A., & Dongarra, J. (Eds.) (2005, May 22-25) Proceedings of the 5th International Conference on Computational Science Springer Self-organizing maps as substitutes for K-means clustering, 476-483,

Beirade, F., Azzoune, H., & Zegour, D. E. (2021) Semantic query for Quranic ontology Journal of King Saud University - Computer and Information Sciences, 33(6), 753-760 .

Benabdallah, A., Abderrahim, M. A., & Abderrahim, M. E. A. (2017) Extraction of terms and semantic relationships from Arabic texts for automatic construction of an ontology International Journal of Speech Technology, 20(2), 289-296 .

Bharti, K. K., & Singh, P. K. (2014) A three-stage unsupervised dimension reduction method for text clustering Journal of Computational Science, 5(2), 156-169 .

Bouras, C., Tsogkas, V., Bellot, G. O., Sasaki, H., Ehmann, M., & Dini, C. (Eds.) (2010, May 9-15) Proceedings of the 5th International Conference on Internet and Web Applications and Services Institute of Electrical and Electronics Engineers Assigning web news to clusters, 1-6

Bsoul, Q., Al-Shamari, E., Mohd, M., Atwan, J., Jaafar, A., Ali, N. M., Noah, S. A. M., Smeaton, A. F., Bruza, P., Bakar, Z. A., Jamil, N., Mohd, T., & Sembok, T. (Eds.) (2014, December 3-5) Proceedings of the 10th Asia Information Retrieval Societies Conference Springer Distance measures and stemming impact on Arabic document clustering, 327-339,

Bsoul, Q., Salim, J., & Zakaria, L. Q. (2016) Document clustering approach to detect crime World Applied Sciences Journal, 34(8), 1026-1036 https://doi.org/10.5829/idosi.wasj.2016.34.8.109.

Cagnina, L., Errecalde, M., Ingaramo, D., & Rosso, P. (2014) An efficient particle swarm optimization approach to cluster short texts Information Sciences, 265, 36-49 .

Cambria, E., & White, B. (2014) Jumping NLP curves: A review of natural language processing research IEEE Computational Intelligence Magazine, 9(2), 48-57 .

Campello, R. J. G. B., & Hruschka, E. R. (2006) A fuzzy extension of the silhouette width criterion for cluster analysis Fuzzy Sets and Systems, 157(21), 2858-2875 .

Cisty, M., & Geem, Z. W. (Ed.) (2010) Recent advances in harmony search algorithm Springer Application of the harmony search optimization in irrigation, pp. 123-134

Connolly, J.-F., Granger, E., & Sabourin, R. (2012) An adaptive classification system for video-based face recognition Information Sciences, 192, 50-70 .

Dai, X., He, Y., Sun, Y., Wang, F. L., & Jin, T. (Eds.) (2010a, October 23-24) Proceedings of the Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence Institute of Electrical and Electronics Engineers A two-layer text clustering approach for retrospective news event detection, 364-368,

Dai, X.-Y., Chen, Q.-C., Wang, X.-L., & Xu, J. (2010b, July 11-14) Proceedings of the 2010 International Conference on Machine Learning and Cybernetics Institute of Electrical and Electronics Engineers Online topic detection and tracking of financial news based on hierarchical clustering, 3341-3346,

Das, S., Abraham, A., & Konar, A. (2007) Automatic clustering using an improved differential evolution algorithm IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 38(1), 218-237 .

Dueck, D., & Frey, B. J. (2007, October 14-21) Proceedings of the 11th International Conference on Computer Vision Institute of Electrical and Electronics Engineers Non-metric affinity propagation for unsupervised image categorization, 1-8, , , Article Id (pmcid)

El Mahdaouy, A., El Alaoui Ouatik, S., & Gaussier, E. (2018) Improving Arabic information retrieval using word embedding similarities International Journal of Speech Technology, 21(1), 121-136 https://hal.archives-ouvertes.fr/hal-01706531.

Elayeb, B. (2019) Arabic word sense disambiguation: A review Artificial Intelligence Review, 52(4), 2475-2532 .

Farhan, Y. H., Mohd, M., & Noah, S. A. M. (2020) Survey of automatic query expansion for Arabic text retrieval Journal of Information Science Theory and Practice, 8(4), 67-86 https://doi.org/10.1633/JISTaP.2020.8.4.6.

Fathian, M., Amiri, B., & Maroosi, A. (2007) Application of honey-bee mating optimization algorithm on clustering Applied Mathematics and Computation, 190(2), 1502-1513 .

Forsati, R., Mahdavi, M., Shamsfard, M., & Reza Meybodi, M. (2013) Efficient stochastic algorithms for document clustering Information Sciences, 220, 269-291 .

Gbadoubissa, J. E. Z., Ari, A. A. A., & Gueroui, A. M. (2020) Efficient K-means based clustering scheme for mobile networks cell sites management Journal of King Saud University - Computer and Information Sciences, 32(9), 1063-1070 .

Grislin-Le Strugeon, E., Marcal, , de Oliveira, K., Thilliez, M., & Petit, D. (2021) A systematic mapping study on agent mining Journal of Experimental & Theoretical Artificial Intelligence .

Güngör, Z., & Ünler, A. (2007) K-harmonic means data clustering with simulated annealing heuristic Applied Mathematics and Computation, 184(2), 199-209 .

Guo, Y. W., Li, W. D., Mileham, A. R., & Owen, G. W. (2009) Applications of particle swarm optimisation in integrated process planning and scheduling Robotics and Computer-Integrated Manufacturing, 25(2), 280-288 .

Harrag, F. (2014) Text mining approach for knowledge extraction in Sahîh Al-Bukhari Computers in Human Behavior, 30, 558-566 .

Hartigan, J. A. (1981) Consistency of single linkage for high-density clusters Journal of the American Statistical Association, 76(374), 388-394 .

Hatamlou, A. (2012) In search of optimal centroids on data clustering using a binary search algorithm Pattern Recognition Letters, 33(13), 1756-1760 .

Hatamlou, A. (2013) Black hole: A new heuristic optimization approach for data clustering Information Sciences, 222, 175-184 .

Hatamlou, A., Abdullah, S., Hatamlou, M., Pichappan, P., Ahmadi, H., & Ariwa, E. (Eds.) Proceedings of the 1st International Conference on Innovative Computing Technology Springer Data clustering using big bang-big crunch algorithm, 383-388,

Hatamlou, A., Abdullah, S., & Nezamabadi-pour, H. (2012) A combined approach for clustering based on K-means and gravitational search algorithms Swarm and Evolutionary Computation, 6, 47-52 .

Hearst, M. A., Dale, R., & Church, K. (Eds.) Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics Association for Computational Linguistics Untangling text data mining, 3-10,

Helwe, C., & Elbassuoni, S. (2019) Arabic named entity recognition via deep co-learning Artificial Intelligence Review, 52(1), 197-215 .

Hmeidi, I., Hawashin, B., & El-Qawasmeh, E. (2008) Performance of KNN and SVM classifiers on full word Arabic articles Advanced Engineering Informatics, 22(1), 106-111 .

Izakian, H., & Abraham, A. (2011) Expert Systems with Applications: An International Journal, 38(3), 1835-1838 .

Jain, A. K. (2010) Data clustering: 50 Years beyond K-means Pattern Recognition Letters, 31(8), 651-666 .

Jo, T. Proceedings of the 1st International Conference on Networked Digital Technologies Institute of Electrical and Electronics Engineers Clustering news groups using inverted index based NTSO, 1-7, , , Article Id (pmcid)

Karaboga, D., & Ozturk, C. (2011) A novel clustering approach: Artificial Bee Colony (ABC) algorithm Applied Soft Computing, 11(1), 652-657 .

Kaur, A., & Sood, S. K. (2020) Cloud-Fog based framework for drought prediction and forecasting using artificial neural network and genetic algorithm Journal of Experimental & Theoretical Artificial Intelligence, 32(2), 273-289 .

Kharrousheh, A., Abdullah, S., & Nazri, M. Z. A. (2011) A modified Tabu search approach for the clustering problem Journal of Applied Sciences, 11(19), 3447-3453 .

Kushwaha, N., & Pant, M. (2021) Fuzzy electromagnetic optimisation clustering algorithm for collaborative filtering Journal of Experimental & Theoretical Artificial Intelligence, 33(4), 601-616 .

Liu, R., Jiao, L., Zhang, X., & Li, Y. (2012) Information Sciences: An International Journal, 204, 1-22 .

Mafarja, M. M., & Mirjalili, S. (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection Neurocomputing, 260, 302-312 .

Mahdavi, M., & Abolhassani, H. (2009) Harmony K-means algorithm for document clustering Data Mining and Knowledge Discovery, 18(3), 370-391 .

Mehra, P. S., Doja, M. N., & Alam, B. (2020) Fuzzy based enhanced cluster head selection (FBECS) for WSN Journal of King Saud University - Science, 32(1), 390-401 .

Menai, M. E. B. (2014) Word sense disambiguation using evolutionary algorithms - application to Arabic language Computers in Human Behavior, 41(C), 92-103 .

Mesmia, F. B., Haddar, K., Friburger, N., Maurel, D., Shaalan, K., Hassanien, A., & Tolba, F. (Eds.) (2018) Intelligent natural language processing: Trends and applications Springer CasANER: Arabic named entity recognition tool, pp. 173-198

Mottaghinia, Z., Feizi-Derakhshi, M.-R., Farzinvash, L., & Salehpour, P. (2021) A review of approaches for topic detection in Twitter Journal of Experimental & Theoretical Artificial Intelligence, 33(5), 747-773 .

Mustafa, S. H. (2005) Information Processing and Management: An International Journal, 41(4), 819-827 .

Niknam, T., & Amiri, B. (2010) An efficient hybrid approach based on PSO, ACO and k-means for cluster analysis Applied Soft Computing, 10(1), 183-197 .

Paci, M., Nanni, L., & Severi, S. (2013) An ensemble of classifiers based on different texture descriptors for texture classification Journal of King Saud University - Science, 25(3), 235-244 .

Qasim, I., Jeong, J.-W., Heu, J.-U., & Lee, D.-H. (2013) Concept map construction from text documents using affinity propagation Journal of Information Science, 39(6), 719-736 .

Qin, A. K., & Suganthan, P. N. (2004) Robust growing neural gas algorithm with application in cluster analysis Neural Networks: The Official Journal of the International Neural Network Society, 17(8-9), 1135-1148 .

Raharjo, S., Wardoyo, R., & Putra, A. E. (2020) Detecting proper nouns in indonesian-language translation of the Quran using a guided method Journal of King Saud University - Computer and Information Sciences, 32(5), 583-591 .

Rostam, N. A. P., & Malim, N. H. A. H. (2021) Text categorisation in Quran and Hadith: Overcoming the interrelation challenges using machine learning and term weighting Journal of King Saud University - Computer and Information Sciences, 33(6), 658-667 .

Safee, M. A. M., Saudi, M. M., Pitchay, S. A., & Ridzuan, F. (2016) A systematic review analysis for Quran verses retrieval Journal of Engineering and Applied Sciences, 11(3), 629-634 https://doi.org/10.36478/jeasci.2016.629.634.

Salloum, S. A., Al Hamad, A. Q., Al-Emran, M., Shaalan, K., Shaalan, K., Hassanien, A. E., & Tolba, F. (Eds.) (2018) Intelligent natural language processing: Trends and applications Springer A survey of Arabic text mining, pp. 417-431

Saloot, M. A., Idris, N., Mahmud, R., Ja'afar, S., Thorleuchter, D., & Gani, A. (2016) Hadith data mining and classification: A comparative analysis Artificial Intelligence Review, 46(1), 113-128 .

Sayed, G. I., Darwish, A., & Hassanien, A. E. (2018) A new chaotic multi-verse optimization algorithm for solving engineering optimization problems Journal of Experimental & Theoretical Artificial Intelligence, 30(2), 293-317 .

Selim, S. Z., & Ismail, M. A. (1984) K-means-type algorithms: A generalized convergence theorem and characterization of local optimality IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1), 81-87 .

Senthilnath, J., Omkar, S. N., & Mani, V. (2011) Clustering using firefly algorithm: Performance study Swarm and Evolutionary Computation, 1(3), 164-171 .

Shahrivari, S., & Jalili, S. (2016) Single-pass and linear-time k-means clustering based on MapReduce Information Systems, 60, 1-12 .

Tabakhi, S., Moradi, P., & Akhlaghian, F. (2014) An unsupervised feature selection algorithm based on ant colony optimization Engineering Applications of Artificial Intelligence, 32, 112-123 .

Touahri, I., & Mazroui, A. (2021) Studying the effect of characteristic vector alteration on Arabic sentiment classification Journal of King Saud University - Computer and Information Sciences, 33(7), 890-898 .

Velmurugan, T., & Santhanam, T. (2011) A survey of partition based clustering algorithms in data mining: An experimental approach Information Technology Journal, 10(3), 478-484 .

Wu, J., Dong, M., Ota, K., Li, J., & Guan, Z. (2018) Big data analysis-based secure cluster management for optimized control plane in software-defined networks IEEE Transactions on Network and Service Management, 15(1), 27-38 .

Yahya, A, A (2018) Centroid particle swarm optimisation for high-dimensional data classification Journal of Experimental & Theoretical Artificial Intelligence, 30(6), 857-886 .

Yauri, A. R., Kadir, R. A., Azman, A., & Murad, M. A. A. (2013) Quranic verse extraction base on concepts using OWL-DL ontology Research Journal of Applied Sciences, Engineering and Technology, 6(23), 4492-4498 .

Zhang, J., Liu, K., Tan, Y., & He, X. (2008, June 7-11) Proceedings of the 2008 International Conference on Neural Networks and Signal Processing Institute of Electrical and Electronics Engineers Random black hole particle swarm optimization and its application, 359-365

Zhang, Y., Li, H.-G., Wang, Q., & Peng, C. (2019) A filter-based bare-bone particle swarm optimization algorithm for unsupervised feature selection Applied Intelligence, 49(8), 2889-2898 .

Zheng, Z., Li, J., & Han, Y. (2020) An improved invasive weed optimization algorithm for solving dynamic economic dispatch problems with valve-point effects Journal of Experimental & Theoretical Artificial Intelligence, 32(5), 805-829 .

FOOTNOTE

¹ https://www.dropbox.com/s/yolg2s3qfj8zn7f/Arabic%20dataset.rar?dl=0

Submission Date: 2021-05-04

Revised Date: 2021-09-22

Accepted Date: 2021-10-10

바로가기메뉴

Article Contents

Vol.9 No.4

Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature

Abstract

1. INTRODUCTION

2. TEXT CLUSTERING FOR ARABIC LANGUAGE AND THE QURAN

Fig. 1

Fig. 2

3. TEXT CLUSTERING PROCESS

3.1. Terms Extraction

Table 1

3.2. Unsupervised Feature Selection

Table 2

Table 3

Table 4

Fig. 3

Fig. 4

Fig. 5

Fig. 6

3.3. Clustering Algorithms

Table 5

Table 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

4. PROPOSED QURAN THEME CLUSTERING APPROACH

4.1. Data Collection and Performance Measure

Table 7

Table 8

4.2. Proposed Term Extraction Method

Fig. 11

4.3. Black Hole Optimization as Unsupervised Feature Selection and Clustering Method

Fig. 12

Fig. 13

Fig. 14

Fig. 15

Fig. 16

5. CONCLUSION

ACKNOWLEDGMENTS

CONFLICTS OF INTEREST

REFERENCES

FOOTNOTE

JOURNAL OF INFORMATION SCIENCE THEORY AND PRACTICE