Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

Lichouri, Mohamed; Abbas, Mourad; Lounnas, Khaled

Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

Files

Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects.pdf (222.22 KB)

Date

2023-01-31

Authors

Lichouri, Mohamed

Abbas, Mourad

Lounnas, Khaled

Abstract

In this paper, we conduct a study to evaluate the performance of statistical and neural methods to classify Arabic Dialects (AD). This evaluation is based on two kinds of corpora. The first one is a corpus named PADIC (Parallel Arabic DIalectal Corpus), which is a multi-dialectal corpus composed of six dialects: two Algerian dialects (of Algiers and Annaba cities), Palestinian, Syrian, Tunisian, and Moroccan, in addition to MSA. The second one (AraDial) is a manually collected corpus that contains the same dialects as well as the same number of sentences as PADIC (6412 sentences for each dialect). In our experiments, we used both statistical and neural classifiers, namely: Gaussian Naive Bayes, Bernoulli Naive Bayes, kNN, Logistic Regression, SGD Classifier, Passive Aggressive Classifier, Perceptron, Linear Support Vector, and Convolutional Neural Network Classifiers. We evaluated these classifiers in two setups: training on a parallel corpus (PADIC) and testing on the non-parallel corpus and vice versa. The obtained results have shown that training our system on a non-parallel corpus will give better results, as we achieved a mean score of 92.08\%.

Keywords

Arabic, Dialect Identification, Non-Parallel Corpus, PADIC, Parallel Corpus

DOI

https://doi.org/10.31730/osf.io/grxt6
https://doi.org/10.60763/africarxiv/481

URI

https://africarxiv.ubuntunet.net/handle/1/523

Collections

Preprint

Full item page

Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

DOI

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

DOI

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Share This Research