No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Bustamante G.; Oncevay A.; Zariquiey R.

Publicación:

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

dc.contributor.author	Bustamante G.	es_PE
dc.contributor.author	Oncevay A.	es_PE
dc.contributor.author	Zariquiey R.	es_PE
dc.date.accessioned	2024-05-30T23:13:38Z
dc.date.available	2024-05-30T23:13:38Z
dc.date.issued	2020
dc.description.abstract	We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
dc.description.sponsorship	Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - Concytec
dc.identifier.scopus	2-s2.0-85096526337
dc.identifier.uri	https://hdl.handle.net/20.500.12390/2648
dc.language.iso	eng
dc.publisher	European Language Resources Association (ELRA)
dc.relation.ispartof	LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	Yine
dc.subject	Ashaninka	es_PE
dc.subject	Corpus creation	es_PE
dc.subject	Endangered languages	es_PE
dc.subject	Indigenous languages	es_PE
dc.subject	Low-resource languages	es_PE
dc.subject	Monolingual corpus	es_PE
dc.subject	Pdf processing	es_PE
dc.subject	Shipibo-Konibo	es_PE
dc.subject	Yanesha	es_PE
dc.subject.ocde	https://purl.org/pe-repo/ocde/ford#6.02.02
dc.title	No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru
dc.type	info:eu-repo/semantics/article
dspace.entity.type	Publication

Archivos

Paquete original

Mostrando 1 - 1 de 1

Nombre:: No data to crawl Monolingual corpus.pdf
Tamaño:: 961.05 KB
Formato:: Adobe Portable Document Format
Descripción:

Descargar

Colecciones

1.1 Eventos institucionales
6.1 Proyectos de investigación científica

Publicación: No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

context-menu.actions.label

Archivos

Paquete original

Colecciones

Publicación:

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru