Publicación:
Language identification with scarce data: A case study from Peru

dc.contributor.author Espichán-Linares A. es_PE
dc.contributor.author Oncevay-Marcos A. es_PE
dc.date.accessioned 2024-05-30T23:13:38Z
dc.date.available 2024-05-30T23:13:38Z
dc.date.issued 2018
dc.description.abstract Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.
dc.description.sponsorship Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica - Concytec
dc.identifier.doi https://doi.org/10.1007/978-3-319-90596-9_7
dc.identifier.isbn urn:isbn:9783319905952
dc.identifier.scopus 2-s2.0-85045991573
dc.identifier.uri https://hdl.handle.net/20.500.12390/672
dc.language.iso eng
dc.publisher Springer Verlag
dc.relation.ispartof Communications in Computer and Information Science
dc.rights info:eu-repo/semantics/openAccess
dc.subject The standard model
dc.subject Deep learning es_PE
dc.subject Information management es_PE
dc.subject Linguistics es_PE
dc.subject Natural language processing systems es_PE
dc.subject Best fit es_PE
dc.subject Complex task es_PE
dc.subject Corpus-based methods es_PE
dc.subject Language identification es_PE
dc.subject Learning approach es_PE
dc.subject Multiple class es_PE
dc.subject State of the art es_PE
dc.subject Big data es_PE
dc.subject.ocde https://purl.org/pe-repo/ocde/ford#2.00.00
dc.title Language identification with scarce data: A case study from Peru
dc.type info:eu-repo/semantics/conferenceObject
dspace.entity.type Publication
Archivos