Language identification with scarce data: A case study from Peru

Espichán-Linares A.; Oncevay-Marcos A.

Publicación:

Language identification with scarce data: A case study from Peru

Fecha

2018

Autores

Espichán-Linares A.

Oncevay-Marcos A.

Editor

Springer Verlag

Abstracto

Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.

Palabras clave

The standard model, Deep learning, Information management, Linguistics, Natural language processing systems, Best fit, Complex task, Corpus-based methods, Language identification, Learning approach, Multiple class, State of the art, Big data

URI

https://hdl.handle.net/20.500.12390/672

Colecciones

1.1 Eventos institucionales
6.1 Proyectos de investigación científica

Página completa del artículo

Publicación:

Language identification with scarce data: A case study from Peru

Fecha

Autores

Título de la revista

Revista ISSN

Título del volumen

Editor

Proyectos de investigación

Unidades organizativas

Número de la revista

Abstracto

Descripción

Palabras clave

Citación

URI

Colecciones

Publicación: Language identification with scarce data: A case study from Peru

context-menu.actions.label

Fecha

Autores

Título de la revista

Revista ISSN

Título del volumen

Editor

Proyectos de investigación

Unidades organizativas

Número de la revista

Abstracto

Descripción

Palabras clave

Citación

URI

Colecciones

Publicación:

Language identification with scarce data: A case study from Peru