Why Trust Open Source AI? | Disruptive Tech News

Why Trust Open Source AI?

August 22, 2024 Bylines

By Huzaifa Sidhpurwala, Principal Product Security Engineer, Red Hat

Huzaifa Sidhpurwala, Principal Product Security Engineer, Red Hat

Artificial intelligence (AI), especially generative AI (gen AI) offers immense opportunities for open research and innovation. Nevertheless, since its inception, AI’s commercialization has raised concerns about several important factors like transparency, reproducibility, and most importantly, its security, privacy and safety.

There have been many debates about the risks and benefits of open sourcing AI models, familiar territory for the open source community, where initial doubt and skepticism often evolve into acceptance. However, there are significant differences between open source code and open source AI models.

What is open source AI?

The definition of an “open source AI model” is still evolving as researchers and industry experts continue to delineate its framework. We don’t intend to dive into the debate or define its parameters in this article. Instead, our focus is presenting and demonstrating how the IBM Granite model is open source and why an open source model is inherently more trustworthy.

Open source licenses

A fundamental part of the open source movement involves publishing software code under licenses that grant users independence and control, giving them the right to inspect, modify and redistribute the code without restrictions. OSI-approved licenses like Apache 2.0 and MIT have been key to enabling worldwide collaborative development, freedom of choice and accelerated progress.

Several models, such as the IBM Granite model and its variants, are released under the permissive Apache 2.0 license. While there are several AI models being released with permissive licenses, these are all faced with a number of challenges, which we will discuss below.

How does an open license help security and safety?

This relates to the core principles of open source. A permissive license allows more users to use and experiment with the model. This permissive license means more security and safety issues can be discovered, reported and, in most cases, fixed.

Open data

The term “large” in “large language model” (LLM) refers to the large amount of data required to train the model, apart from the many parameters that constitute the model. Model efficacies are often measured by the number of input tokens—often trillions for a good model—used to train the data.

For most closed models, the data sources used to pre-train and fine-tune the model are secret and form the very basis of differentiation from similar products created by other companies. We believe that for an AI model to be truly open source, it is important to reveal the data used to pre-train and fine-tune that model.

The corpus of data used to train Granite foundation models is documented in detail, along with the governance and safety workflows applied to the data before it is sent to the training pipeline.

How does open data help security and safety?

The kind of data generated by a large language model during inference depends on the data with which the model has been trained. Open data provides a way for community members to examine the data used to train the model and verify that no hazardous data is used in the pipeline. Also, open governance practices help reduce model bias, in that biases can be identified and removed from the pre-training phase itself.

Freedom to modify and share

This brings us to the challenges of models released with permissive licenses:

Because of the way these models are created and distributed, it’s not possible to directly contribute to the models themselves. Since this is the case, these community contributions show up as forks of the original model. This forces consumers to choose a “best-fit“ model that isn’t easily extensible, and these forks are expensive for model creators to maintain.
Most people find it difficult to fork, train and refine models, because of their lack of knowledge of AI and machine learning (ML) technologies.
There is a lack of community governance or best practices around review, curation and distribution of forked models.

Red Hat and IBM introduced InstructLab, a model-agnostic open source AI project that simplifies the process of contributing to LLMs. The technology gives model upstreams with sufficient infrastructure resources the ability to create regular builds of their open source licensed models.

These resources are not used for rebuilding and retraining the entire model, but rather to refine that model through the addition of new skills and knowledge. These projects would be able to accept pull requests for these refinements and include those in the next build.

In short, InstructLab allows the community to contribute to AI models without forking them. These contributions can be sent “upstream,“ which allows the developers to rebuild the original model with the new taxonomy, which can be further shared with other users and contributors.

How does the freedom to modify and share help security and safety?

This allows community members to add their own data to the base model in a trustworthy way. They can also fine-tune the safety parameters of the model by using taxonomy, which adds additional safety guardrails. The community can also improve the security and safety posture of the model without repeating the pre-training, which is expensive and time consuming.

IBM and Red Hat are part of the AI Alliance, which is seeking to define what open source AI means at an industry scale in terms of governance, process and practice.

Open, transparent and responsible AI will help advance AI safety, giving the open community of developers and researchers the ability to address the significant risks of AI and mitigate them with the most appropriate solutions.

(0)

Covering Disruptive Technology Powering Business in The Digital Age

Archive