CTAB: Corpus of Tunisian Arabizi

dc.contributor.authorAmara, Amina
dc.contributor.authorTurki, Houcemeddine
dc.contributor.authorHadj Taieb, Mohamed Ali
dc.contributor.authorBen Aouicha, Mohamed
dc.contributor.authorEllouze, Kawthar
dc.date.accessioned2024-03-17T20:17:17Z
dc.date.available2024-03-17T20:17:17Z
dc.date.issued2021-05-11
dc.description.abstractThis dataset has been created between 2017 and 2021 to provide a textual resource that can be used to study the behaviors of Tunisian people in writing Tunisian Arabic (ISO 693-3: aeb) in Latin Script. This corpus is constituted from messages written using Tunisian Arabic Chat Alphabet or Arabizi and is developed to solve the matter of the lack of NLP databases about the use of the Latin Script for transcribing Tunisian Arabic. The messages are automatically pulled using web scraping of Facebook public pages and are kept as they are without any annotation, spelling adjustments or morphological and syntactic labeling. Then, messages that are written in Latin Script but not in Tunisian Arabic are manually eliminated. Finally, every collection of messages that are retrieved from the same Facebook page in the same period is included in the same text file where every message is featured as one line.
dc.description.provenanceSubmitted by Louis Kalampa (louiekalampa@gmail.com) on 2024-03-17T20:17:17Z No. of bitstreams: 15 CTAB-SAMPLE0013.txt: 21265 bytes, checksum: a4f6b25fe7a97df601750f538046bb4c (MD5) CTAB-SAMPLE0012.txt: 32657 bytes, checksum: 263ac4be6e096a3078e08ae8a5c4fb3f (MD5) CTAB-SAMPLE0011.txt: 39761 bytes, checksum: 89c64af867793fe690531722348edc1d (MD5) CTAB-SAMPLE0010.txt: 27756 bytes, checksum: 7424d805e2502b222805c03bf956a1e5 (MD5) CTAB-SAMPLE0009.txt: 20541 bytes, checksum: 2d51ebbeeaa5431c36765f76e438f3c8 (MD5) CTAB-SAMPLE0008.txt: 60071 bytes, checksum: 17e4180b6af6965217a5a0638353d0a8 (MD5) CTAB-SAMPLE0007.txt: 5070 bytes, checksum: 6f3a59e99c2330a45f520569c38bb18d (MD5) CTAB-SAMPLE0006.txt: 21659 bytes, checksum: b01d16e95e674e3ded630e92589cb33e (MD5) CTAB-SAMPLE0005.txt: 23349 bytes, checksum: 3b8f8919a844931bee9070d12f5de8a7 (MD5) CTAB-SAMPLE0004.txt: 22917 bytes, checksum: db569c9391eadb4b5f796facdf853802 (MD5) CTAB-SAMPLE0004.txt: 22917 bytes, checksum: db569c9391eadb4b5f796facdf853802 (MD5) CTAB-SAMPLE0003.txt: 3114 bytes, checksum: ea1a3a417892d9935db8ba70d0fcf03e (MD5) CTAB-SAMPLE0002.txt: 5912 bytes, checksum: 28a1c0d0bae0ccb61427446748e714bb (MD5) CTAB-SAMPLE0001.txt: 22391 bytes, checksum: dc42ee73cf7e5b125188785d0e26306f (MD5) CTAB.pdf: 50098 bytes, checksum: 5dee23175faaaf83fffea0d4bcda78de (MD5)en
dc.description.provenanceMade available in DSpace on 2024-03-17T20:17:17Z (GMT). No. of bitstreams: 15 CTAB-SAMPLE0013.txt: 21265 bytes, checksum: a4f6b25fe7a97df601750f538046bb4c (MD5) CTAB-SAMPLE0012.txt: 32657 bytes, checksum: 263ac4be6e096a3078e08ae8a5c4fb3f (MD5) CTAB-SAMPLE0011.txt: 39761 bytes, checksum: 89c64af867793fe690531722348edc1d (MD5) CTAB-SAMPLE0010.txt: 27756 bytes, checksum: 7424d805e2502b222805c03bf956a1e5 (MD5) CTAB-SAMPLE0009.txt: 20541 bytes, checksum: 2d51ebbeeaa5431c36765f76e438f3c8 (MD5) CTAB-SAMPLE0008.txt: 60071 bytes, checksum: 17e4180b6af6965217a5a0638353d0a8 (MD5) CTAB-SAMPLE0007.txt: 5070 bytes, checksum: 6f3a59e99c2330a45f520569c38bb18d (MD5) CTAB-SAMPLE0006.txt: 21659 bytes, checksum: b01d16e95e674e3ded630e92589cb33e (MD5) CTAB-SAMPLE0005.txt: 23349 bytes, checksum: 3b8f8919a844931bee9070d12f5de8a7 (MD5) CTAB-SAMPLE0004.txt: 22917 bytes, checksum: db569c9391eadb4b5f796facdf853802 (MD5) CTAB-SAMPLE0004.txt: 22917 bytes, checksum: db569c9391eadb4b5f796facdf853802 (MD5) CTAB-SAMPLE0003.txt: 3114 bytes, checksum: ea1a3a417892d9935db8ba70d0fcf03e (MD5) CTAB-SAMPLE0002.txt: 5912 bytes, checksum: 28a1c0d0bae0ccb61427446748e714bb (MD5) CTAB-SAMPLE0001.txt: 22391 bytes, checksum: dc42ee73cf7e5b125188785d0e26306f (MD5) CTAB.pdf: 50098 bytes, checksum: 5dee23175faaaf83fffea0d4bcda78de (MD5) Previous issue date: 2021-05-11en
dc.identifier.doihttps://doi.org/10.5281/zenodo.4781769
dc.identifier.doihttps://doi.org/10.60763/africarxiv/704
dc.identifier.urihttps://africarxiv.ubuntunet.net/handle/1/748
dc.subjectTunisian Arabic
dc.subjectLatin Script
dc.subjectArabic Chat Alphabet
dc.titleCTAB: Corpus of Tunisian Arabizi

Files

Original bundle

Now showing 1 - 5 of 15
Loading...
Thumbnail Image
Name:
CTAB-SAMPLE0013.txt
Size:
20.77 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
CTAB-SAMPLE0012.txt
Size:
31.89 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
CTAB-SAMPLE0011.txt
Size:
38.83 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
CTAB-SAMPLE0010.txt
Size:
27.11 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
CTAB-SAMPLE0009.txt
Size:
20.06 KB
Format:
Plain Text

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed to upon submission
Description:

Collections