No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru Bustamante G. es_PE Oncevay A. es_PE Zariquiey R. es_PE 2024-05-30T23:13:38Z 2024-05-30T23:13:38Z 2020
dc.description.abstract We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
dc.description.sponsorship Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - Concytec
dc.identifier.scopus 2-s2.0-85096526337
dc.language.iso eng
dc.publisher European Language Resources Association (ELRA)
dc.relation.ispartof LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
dc.rights info:eu-repo/semantics/openAccess
dc.subject Yine
dc.subject Ashaninka es_PE
dc.subject Corpus creation es_PE
dc.subject Endangered languages es_PE
dc.subject Indigenous languages es_PE
dc.subject Low-resource languages es_PE
dc.subject Monolingual corpus es_PE
dc.subject Pdf processing es_PE
dc.subject Shipibo-Konibo es_PE
dc.subject Yanesha es_PE
dc.title No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
dc.type info:eu-repo/semantics/article
dspace.entity.type Publication