artificial intelligence

Lodi, awit, omsim: The nuances of Filipino AI language training

Rappler.com

This is AI generated summarization, which may have errors. For context, always refer to the full article.

Lodi, awit, omsim: The nuances of Filipino AI language training
DOST researchers are working on a large language model that would better understand modern Filipino's mixed usage of Filipino, English, and occasional Spanish words, along with informal lingo and colloquialisms

MANILA, Philippines — Amidst the mainstream adoption artificial intelligence, researchers behind the iTANONG project are making it possible for Filipinos to use a mix of English, Filipino, and Taglish to inquire from business, academic, and general information databases.

In 2022, the Department of Science and Technology – Advanced Science Technology Institute (DOST-ASTI) launched the development of an AI-powered interface capable of understanding questions typed in natural language. 

For job applicants, this means being able to ask, “Kailan pwedeng kunin ang aking NBI Clearance?” (“When can I get my NBI clearance?”) or “Ilang months na ang aking contribution sa PhilHealth?” (“For how many months have I been contributing to PhilHealth?”) 

For businesses, iTANONG can speed up transactions by giving employees the ability to check inventory goods or pending purchases in their internal records.

To address the challenges of integrating Filipino text in iTANONG, the team is building upon OpenAI’s Generative Pre-trained Transformer (GPT) model and combining it with their own in-house models.

What is Natural Language Processing (NLP)? 

Interest in using everyday language for computing traces back to the late 1940s, when research focused on machine translation. By 1950, Alan Turing had introduced the Turing Test to exhibit a computer’s ability to imitate human responses. Today, NLP plays a central role in the AI revolution.

With NLP, computers interact with natural language data to decipher user-inputted prompts such as with ChatGPT.

The use of AI-enabled text generation soared to public awareness after the launch of ChatGPT, a large language model (LLM), in November 2022. LLMs are highly-capable models that try to resolve NLP-related problems.

According to Elmer Peramo, the project leader behind iTANONG, a major gap in training LLMs today is the small chunk of Filipino dataset available for the machine learning process.

In the world of NLP, Filipino is considered a “low-resource language.” DOST-ASTI uses synthetic data generation as well as data scraping from Filipino sources to tackle this deficiency.

Challenges of Filipino LLMs

“Although very powerful ‘yung mga lumalabas na large language models as of late, sometimes they still fail to capture the subtle nuances ng language natin (Although the large language models released as of late are very powerful, sometimes they still fail to capture the subtle nuances of our language),” said Aunhel John Adoptante, one of DOST-ASTI’s researchers.

Another major challenge of Philippine AI training is the local culture of code-switching. Brought on by the country’s history as a colony under American and Spanish regimes, modern-day Filipino speech now consists of a mix of Filipino, English, and occasionally Spanish words.

Add to this the distinctions of colloquial words, dialect-specific terminologies, and non-formal vernaculars such as Gaylingo and Jejemon.

To resolve the implications of these differences, DOST-ASTI researcher Moses Visperas said that they use normalizing techniques and data filtering for their models.

Magcomputer, mag-computer, and mag-kompyuter na iba’t iba ang spelling – dapat manormalize siya into one term,” Visperas said.

By introducing intentionally-altered words, referred to as “noise” in predominantly clean databases, the computer learns to normalize informal expressions such as “sorryyy,” and “suriii,” into just “sorry.” Eventually, the model gets better at identifying alterations and deriving their meanings.

The team is also strengthening their models to be able to address colloquial trends such as the reversal of words for casual usage — “lodi” for idol, “omsim” for mismo, and “Etivac” for Cavite.

Philippines and the AI revolution

According to Peramo, databases are usually called “backend assets” due to the high barriers that exist before non-technical users are able to access them. Programmers and computer scientists commonly use a programming language called Structured Query Language (SQL) to get valuable information from relational databases.

With iTANONG’s use of artificial intelligence, this hurdle is essentially eliminated. Even those who are not versed in SQL are able to seamlessly gather information from the same repository.

According to Peramo, DOST encourages the adaptation of artificial intelligence through initiatives in capacity-building, infrastructure development, and AI-focused research such as the iTANONG project.

“You really have to invest in these ICT infrastructures,” Peramo said about DOST-ASTI’s Computing and Archiving Research Environment. The facility provides access to high-performance computers. “Hindi ka makakapagbuild ng mga AI models without these.” (You can’t build AI models without these.)

Peramo also emphasized DOST’s initiatives to develop data regulation policies and agreements on ethical uses for artificial intelligence.

For Adoptante, the iTANONG project is a way to bridge the gap between the academe and the public, “For the nontechnical person, [so that] maappreciate nila ‘yung mga advancements and technical innovations in a way na magagamit nila in their day-to-day activities, such as getting the information that they need with the data that’s available.”

(“For the nontechnical person, [so that] they appreciate the advancements and technical innovations in a way that they can use in their day-to-day activities, such as getting the information that they need with the data that’s available.”)  – Jessica Bonifacio/Rappler.com

Jessica Bonifacio is an incoming third-year Environmental and Sanitary Engineering student in National University – Manila. She is a volunteer under Rappler’s Research Unit.

Add a comment

Sort by

There are no comments yet. Add your comment to start the conversation.

Summarize this article with AI

How does this make you feel?

Loading
Download the Rappler App!