Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

Abstract

In this paper, we conduct a study to evaluate the performance of statistical and neural methods to classify Arabic Dialects (AD). This evaluation is based on two kinds of corpora. The first one is a corpus named PADIC (Parallel Arabic DIalectal Corpus), which is a multi-dialectal corpus composed of six dialects: two Algerian dialects (of Algiers and Annaba cities), Palestinian, Syrian, Tunisian, and Moroccan, in addition to MSA. The second one (AraDial) is a manually collected corpus that contains the same dialects as well as the same number of sentences as PADIC (6412 sentences for each dialect). In our experiments, we used both statistical and neural classifiers, namely: Gaussian Naive Bayes, Bernoulli Naive Bayes, kNN, Logistic Regression, SGD Classifier, Passive Aggressive Classifier, Perceptron, Linear Support Vector, and Convolutional Neural Network Classifiers. We evaluated these classifiers in two setups: training on a parallel corpus (PADIC) and testing on the non-parallel corpus and vice versa. The obtained results have shown that training our system on a non-parallel corpus will give better results, as we achieved a mean score of 92.08\%.

Description

Keywords

Arabic, Dialect Identification, Non-Parallel Corpus, PADIC, Parallel Corpus

Citation

Collections