No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Bustamante G.; Oncevay A.; Zariquiey R.

Publicación:

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Archivos

No data to crawl Monolingual corpus.pdf (961.05 KB)

Fecha

2020

Autores

Bustamante G.

Oncevay A.

Zariquiey R.

Editor

European Language Resources Association (ELRA)

Abstracto

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.

Palabras clave

Yine, Ashaninka, Corpus creation, Endangered languages, Indigenous languages, Low-resource languages, Monolingual corpus, Pdf processing, Shipibo-Konibo, Yanesha

URI

https://hdl.handle.net/20.500.12390/2648

Colecciones

1.1 Eventos institucionales
6.1 Proyectos de investigación científica

Página completa del artículo

Publicación:

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Archivos

Fecha

Autores

Título de la revista

Revista ISSN

Título del volumen

Editor

Proyectos de investigación

Unidades organizativas

Número de la revista

Abstracto

Descripción

Palabras clave

Citación

URI

Colecciones

Publicación: No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

context-menu.actions.label

Archivos

Fecha

Autores

Título de la revista

Revista ISSN

Título del volumen

Editor

Proyectos de investigación

Unidades organizativas

Número de la revista

Abstracto

Descripción

Palabras clave

Citación

URI

Colecciones

Publicación:

No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru