Natural Language Processing in the Spanish Language: We Are South American Rockers
Whenever it comes to implementing natural language processing in Spanish, there are fewer resources and a considerable amount of obstacles. And when you have fewer resources, creativity is key.
By Ivo Perich
February 18, 2021
In Chile, a rock band from the ’80s called Los Prisioneros had a song called “We are South American Rockers.” The lyrics go: “South American rockers, South American dreamers, no millions, no fans, no Cadillac (…) Elvis, shake in your grave… ’cause we are South American rockers!”
I like to use this analogy as a way to explain how South American natural language processing (NLP) developers get things done—with no millions and no Cadillacs. Whenever it comes to implementing natural language processing in Spanish, whatever we have to do, we have fewer resources and a considerable amount of obstacles compared with our Anglo-speaking colleagues. And when you have fewer resources, creativity is key. For us South Americans, our work often starts with our heads in our hands, thinking, “I have no idea how to do this, but it’s okay, I have a week.” We have lots of experience dreaming up creative solutions with the clock ticking.
In the world of chatbots, there are basically two things you have to do if you want to understand and solve your user’s problems:
1. Intent classification. Understand what the user is trying to say.
2. Dialog management. Identify what to do with that intent, what branch in the conversation tree to follow for that intent.
In the end, the more important and difficult of these tasks is, by far, the intent classification, and that’s why AI is there.
The problem is that the intent classification tools are not that good for Spanish.
Let’s go over some of the overarching perspectives that explain how the barriers of NLP in Spanish work.
To put it into perspective, although there are over 7,000 languages spoken around the world, the vast majority of NLP processes amplify seven key languages: English, Chinese, Urdu, Farsi, Arabic, French, and Spanish. Even among these seven languages, the vast majority of technological advances have been achieved in English-based NLP systems.
For example, Microsoft LUIS, IBM Watson Assistant, Google Dialogflow, Amazon Lex, etc., are developed mainly for English, and their Spanish classifiers, not surprisingly, are not as good as the English ones.
In addition to this, Spanish is considered to be a more intricate language compared to English, having a lot of verb conjugations, more articles, and generally more words to say the same thing as in English. So it’s more difficult to classify.
Here in Chile (in Argentina, too), we have an extra problem: Our Spanish is different from standard. We use different conjugations, different grammar, different vocabulary.
Us Latin American NLP developers are always trying to be up to date and informed about the state of the art in NLP, and we discuss among colleagues things like Google’s neural network BERT, or the amazing Open AI’s GPT-3, but we are always aware of the sad truth: The target audience for these tools is not mainly Spanish speaking, so their performance is just not that good. Not yet. We live, in that sense, in the past. (I would say we’re maybe five years behind.)
The clients, however, live in the present and are just as demanding as in the English market, and of course, as usual, they always want more.
You can now understand why South American NLP developers are used to starting their projects with their heads in their hands.
No millions, no fans, no Cadillac.
Given this sad state of affairs, with all these wonderful tools that work in a less than optimal way, what can be done?
Help the tool
The first thing to have in mind is—whether you use Amazon Lex, Dialogflow, LUIS, or Watson—you will have to help the tool.
Here is a brief list of common things to do so:
1. Normalization. You can normalize the text by replacing the special Spanish characters, like á, é, or ñ, with their standard versions, a, e, or n. This is because users may or may not use them. If the verb tense of the sentence doesn’t matter, you can set all the verbs in the infinitive with a tool like spaCy. These fixes are not foolproof (guess what, spaCy in Spanish is not that good), but hey, they’re a start.
2. Find patterns. Do a lot of testing. Use your friends and/or your client counterparts to understand how they talk to the bot. Get data from their help centers. Read the chats of real users with the client staff. Get all the data you can, and then look at it, look at it again, and then put on your headphones, and look at it again and again. Stare at the screen for hours, like a psychopath. In the end, patterns will appear, and you can use them for the next step.
3. Find rules. An interesting thing to do is to make rules by finding words or groups of words (the patterns you found) that are only related to one intent. Find words or groups of words that indicate that particular intent is NOT the one, make rules, and use these rules to tell the tool what it has to do more accurately.
4. Divide and conquer. If the classifier has fewer intents to classify, then it probably classifies better. You can have several classifiers specialized in just one subject. For example, you can have the classifier loan, and use this to classify only intents such as get a loan, pay a loan, etc. You can have several of these classifiers and, over them, a system of rules that decide which classifier will be used—for example, the classifier loans or the classifier accounts, etc. In fact, you can even have not a system of rules, but a “boss classifier” on its own, like a text classifier from Amazon SageMaker, to decide just which classifier to use, and so divide the classification in a set of sub-classifications that will work better than a big, monolithic classifier.
5. Have fun. Experiment, read a lot, talk with other developers, try stuff, innovate, build little tools and toys, and be open. Recycle ideas: Things that worked in one project will probably be useful in another. I have created my own Spanish synonym finder to normalize text (with more or less success), scripts that find patterns by themselves to analyze client data, and other tools that I have used in several projects.
6. Keep in touch. It is important to keep the client happy, keep the expectations at the right level, get more and more data, and, at every moment, have good analyses to show them how the project is coming along and give them good insights. These analyses can be of several types, but that’s a topic for another post.
In 2020, the NLP market in Latin America generated revenues of US$76 million, with growth expected to reach US$105 million in 2021 and US$250 million by 2025. There’s a big opportunity for the growing number of companies that are developing their NLP abilities for a market that is asking for more every day.
I work in one of those companies, Cognitiva, and we are doing well. We have happy clients in Chile, Perú, and Colombia. Our chatbots and other NLP projects have a success rate of 80–85 percent accuracy in our best cases.
Elvis, shake in your grave…
’Cause we are South American rockers!