Project SEALD Gets Rolling as AI Singapore, Google Forge Exciting Partnership to Enhance LLM Training in Southeast Asia

March 12, 2024 News AI Singapore Artificial Intelligence Google large language model Project SEALD SEA-LION

AI Singapore (AISG) and Google Research have embarked on Project SEALD (Southeast Asian Languages in One Network Data), a research collaboration to enhance datasets that can be used to train, fine-tune, and evaluate large language models (LLMs) in languages spoken across Southeast Asia (SEA). This collaboration seeks to improve cultural context awareness and capabilities in SEA LLMs, and advance their applicability across the region to bring broad benefits to society.

“VISTEC is excited to be part of this pan-ASEAN natural language processing (NLP) development offered by Project SEALD, a vital collaborative mechanism that sets our diverse NLP communities in one collective and strategic direction. In particular, Project SEALD will alleviate the resource constraints associated with incorporating SEA languages into AI innovations by delivering new pre-trained language models, datasets, and benchmarks. VISTEC is proud to be an official partner, contributing our expertise in Thai NLP to this project,” said Sarana Nutanong of the Vidyasirimedhi Institute of Science and Technology in Thailand.

Project SEALD: Improving Inclusivity in SEA LLMs

Starting with Indonesian, Thai, Tamil, Filipino, and Burmese, the research under Project SEALD will help build a diverse and high-quality data corpus of languages spoken in SEA to support the training of models under SEA-LION (Southeast Asian Languages in One Network)—an initiative by AISG to develop a family of LLMs specifically pre-trained and instruction-tuned to be more representative of SEA’s cultural contexts and linguistic nuances—and other models that can add value to SEA-centric use cases.

Under Project SEALD, AISG and Google Research Asia Pacific (APAC) will work together on:

Developing translocaliswation and translation models,
Establishing best practices for instruction tuning datasets,
Creating tools to enable translocalisation at scale, and
Publishing pre-training recipes for SEA languages.

AISG and Google will release the datasets and output from Project SEALD in open source to advance the progress of the SEA LLM ecosystem and foster strong regional expertise.

“The SEA-LION LLM project has always been about building a community and ecosystem that will continuously work together to enhance the quality of the SEA-LION data corpus and continuously improve SEA-LION’s capabilities. We are happy that Google now stands as a key part of the SEA-LION ecosystem and we look forward to building better datasets through Project SEALD in collaboration with Google for the benefit of the entire community,” said Leslie Teo, Senior Director of AI Products at AISG.

Addressing Underrepresentation in AI

As a specific use case, Project SEALD is working to improve communications with under-represented populations of migrant workers in Singapore, who may speak and understand a variety of regional languages with greater fluency than English. Data collection efforts to better capture linguistic nuances within this community will provide the foundation for enhanced engagement by both the Singapore Government and employers. Project SEALD

When integrated into one of the generative AI solutions first developed under the AI Trailblazers initiative by the Singapore Government and Google Cloud, the datasets and output from Project SEALD can aid outreach across a variety of important domains, such as redressal of worker grievances and extension of assistance schemes.

Lastly, Project SEALD will engage with ecosystem partners—academia, industry, and government—in various ways. These include working with industry players for data collection, curation, and quality checks, collaborating with academia in different SEA countries to implement state-of-the-art techniques in evaluation and benchmarking, and partnering with government stakeholders in Singapore and across the region to advance use cases for public good.

Advancing SEA LLMs for the Region

Building on this, AISG is collaborating with Google Cloud to make its SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI, which provides organizations with access to first-party, third-party, and open models that meet Google Cloud’s strict enterprise safety and quality standards.

“Google is proud to be partnering with AISG to put Singapore and SEA on the map of AI model development,” said Yolyn Ang, Vice President, Knowledge and Information Partnerships, at Google APAC, By focusing on languages spoken and used in SEA and cultural understanding, Project SEALD will significantly improve the existing corpus and evaluation benchmarks for these languages. This will open new opportunities and make AI more inclusive, accessible, and helpful for individuals and businesses throughout the region.”

Through Vertex AI, organisations can use enterprise-grade tools to easily customise these models to address relevant use cases and integrate them into their applications. In addition, AISG will continue to make its SEA-LION LLMs available on Hugging Face, which has been partnering with Google Cloud to help developers train, tune, and serve open models quickly and cost-effectively.

AISG has also initiated collaborations across Singapore and other SEA countries. For example, AISG has signed Memorandums of Understanding (MOUs) or Letters of Intent (LOIs) with Indonesian, Malaysian, and Vietnamese entities for the development of datasets and applications for regional LLMs. In addition, AISG has been engaging partners in Thailand, the Philippines, and Indonesia to build resources on regional language syntax and semantics. Finally, in the Singapore context, AISG works closely with public sector and R&D stakeholders on safety alignment and multimodality.

In APAC, Google Research has a similar large-scale language inclusivity project ongoing in India with the Indian Institute of Science via Project Vaani—an initiative that is gathering, transcribing, and open-sourcing speech data from across all of India’s 773 districts.

“As we continue to work with AISG through XFORM, Inc. in developing localized, comprehensive, and inclusive datasets, we are looking forward to contributing to Project SEALD, which will make a significant contribution in building localized, culture-driven, context-sensitive, and open-source LLMs for SEA through the Ateneo Social Computing Science Laboratory,” added Maria Regina Estuar, Head, Ateneo Social Computing Science Laboratory and CEO at XFORM, Inc.

(0)

Covering Disruptive Technology Powering Business in The Digital Age

Project SEALD: Improving Inclusivity in SEA LLMs

Addressing Underrepresentation in AI

Advancing SEA LLMs for the Region

Archive