Datasets

Permanent URI for this collection

https://africarxiv.ubuntunet.net/handle/1/273

Browse

Now showing 1 - 3 of 3

CTAB: Corpus of Tunisian Arabizi
(2021-05-11) Amara, Amina; Turki, Houcemeddine; Hadj Taieb, Mohamed Ali; Ben Aouicha, Mohamed; Ellouze, Kawthar
This dataset has been created between 2017 and 2021 to provide a textual resource that can be used to study the behaviors of Tunisian people in writing Tunisian Arabic (ISO 693-3: aeb) in Latin Script. This corpus is constituted from messages written using Tunisian Arabic Chat Alphabet or Arabizi and is developed to solve the matter of the lack of NLP databases about the use of the Latin Script for transcribing Tunisian Arabic. The messages are automatically pulled using web scraping of Facebook public pages and are kept as they are without any annotation, spelling adjustments or morphological and syntactic labeling. Then, messages that are written in Latin Script but not in Tunisian Arabic are manually eliminated. Finally, every collection of messages that are retrieved from the same Facebook page in the same period is included in the same text file where every message is featured as one line.
Database of Parenthetic Biomedical Abbreviations
(2020-11-19) Turki, Houcemeddine; Hadj Taieb, Mohamed Ali; Ben Aouicha, Mohamed
This dataset includes the biomedical abbreviations stated between parentheses in the titles of the scholarly publications indexed by PubMed between 1947 and 2019. Each abbreviation is extracted thanks to the parenthetic level count algorithm and is assigned to the title, PMID and year of publication of each corresponding research paper. Then, every acronym is allocated its length and the number of upper and lower case letters it involves. Finally, the entities including one or no upper case letter, less than three characters, eight characters or more, or a high rate of non-alphanumeric characters are semi-automatically eliminated to ensure the consistency of the research database.
Database of Parenthetic Biomedical Abbreviations
(2020-11-19) Turki, Houcemeddine; Hadj Taieb, Mohamed Ali; Ben Aouicha, Mohamed
This dataset includes the biomedical abbreviations stated between parentheses in the titles of the scholarly publications indexed by PubMed between 1947 and 2019. Each abbreviation is extracted thanks to the parenthetic level count algorithm and is assigned to the title, PMID and year of publication of each corresponding research paper. Then, every acronym is allocated its length and the number of upper and lower case letters it involves. Finally, the entities including one or no upper case letter, less than three characters, eight characters or more, or a high rate of non-alphanumeric characters are semi-automatically eliminated to ensure the consistency of the research database.

Browse

Browsing Datasets by Author "will be generated::orcid::0000-0002-2786-8913"

Results Per Page

Sort Options