No data to crawl? Monolingual corpus creation from PDF files of truly low-resource languages in Peru

Castillo-Velarde C.D.
Flores Rojas J.L.
Kumar, S.
Martínez Castro D.
Moya-Álvarez A.
Silva Y.
Srivastava S.
European Language Resources Association (ELRA)
We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
