Publicación:
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
dc.contributor.author | Bustamante G. | es_PE |
dc.contributor.author | Oncevay A. | es_PE |
dc.contributor.author | Zariquiey R. | es_PE |
dc.date.accessioned | 2024-05-30T23:13:38Z | |
dc.date.available | 2024-05-30T23:13:38Z | |
dc.date.issued | 2020 | |
dc.description.abstract | We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays. | |
dc.description.sponsorship | Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - Concytec | |
dc.identifier.scopus | 2-s2.0-85096526337 | |
dc.identifier.uri | https://hdl.handle.net/20.500.12390/2648 | |
dc.language.iso | eng | |
dc.publisher | European Language Resources Association (ELRA) | |
dc.relation.ispartof | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings | |
dc.rights | info:eu-repo/semantics/openAccess | |
dc.subject | Yine | |
dc.subject | Ashaninka | es_PE |
dc.subject | Corpus creation | es_PE |
dc.subject | Endangered languages | es_PE |
dc.subject | Indigenous languages | es_PE |
dc.subject | Low-resource languages | es_PE |
dc.subject | Monolingual corpus | es_PE |
dc.subject | Pdf processing | es_PE |
dc.subject | Shipibo-Konibo | es_PE |
dc.subject | Yanesha | es_PE |
dc.subject.ocde | https://purl.org/pe-repo/ocde/ford#6.02.02 | |
dc.title | No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru | |
dc.type | info:eu-repo/semantics/article | |
dspace.entity.type | Publication |