Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

dc.contributor.authorLichouri, Mohamed
dc.contributor.authorAbbas, Mourad
dc.contributor.authorLounnas, Khaled
dc.date.accessioned2024-03-15T07:07:34Z
dc.date.available2024-03-15T07:07:34Z
dc.date.issued2023-01-31
dc.description.abstractIn this paper, we conduct a study to evaluate the performance of statistical and neural methods to classify Arabic Dialects (AD). This evaluation is based on two kinds of corpora. The first one is a corpus named PADIC (Parallel Arabic DIalectal Corpus), which is a multi-dialectal corpus composed of six dialects: two Algerian dialects (of Algiers and Annaba cities), Palestinian, Syrian, Tunisian, and Moroccan, in addition to MSA. The second one (AraDial) is a manually collected corpus that contains the same dialects as well as the same number of sentences as PADIC (6412 sentences for each dialect). In our experiments, we used both statistical and neural classifiers, namely: Gaussian Naive Bayes, Bernoulli Naive Bayes, kNN, Logistic Regression, SGD Classifier, Passive Aggressive Classifier, Perceptron, Linear Support Vector, and Convolutional Neural Network Classifiers. We evaluated these classifiers in two setups: training on a parallel corpus (PADIC) and testing on the non-parallel corpus and vice versa. The obtained results have shown that training our system on a non-parallel corpus will give better results, as we achieved a mean score of 92.08\%.
dc.identifier.doihttps://doi.org/10.31730/osf.io/grxt6
dc.identifier.urihttps://africarxiv.ubuntunet.net/handle/1/523
dc.identifier.urihttps://doi.org/10.60763/africarxiv/481
dc.identifier.urihttps://doi.org/10.60763/africarxiv/481
dc.identifier.urihttps://doi.org/10.60763/africarxiv/481
dc.subjectArabic
dc.subjectDialect Identification
dc.subjectNon-Parallel Corpus
dc.subjectPADIC
dc.subjectParallel Corpus
dc.titleImpact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects.pdf
Size:
222.22 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed to upon submission
Description:

Collections