By Huzaifa Sidhpurwala, Principal Product Security Engineer, Red Hat
Artificial intelligence (AI), especially generative AI (gen AI) offers immense opportunities for open research and innovation. Nevertheless, since its inception, AI’s commercialization has raised concerns about several important factors like transparency, reproducibility, and most importantly, its security, privacy and safety.
There have been many debates about the risks and benefits of open sourcing AI models, familiar territory for the open source community, where initial doubt and skepticism often evolve into acceptance. However, there are significant differences between open source code and open source AI models.
What is open source AI?
The definition of an “open source AI model” is still evolving as researchers and industry experts continue to delineate its framework. We don’t intend to dive into the debate or define its parameters in this article. Instead, our focus is presenting and demonstrating how the IBM Granite model is open source and why an open source model is inherently more trustworthy.
Open source licenses
A fundamental part of the open source movement involves publishing software code under licenses that grant users independence and control, giving them the right to inspect, modify and redistribute the code without restrictions. OSI-approved licenses like Apache 2.0 and MIT have been key to enabling worldwide collaborative development, freedom of choice and accelerated progress.
Several models, such as the IBM Granite model and its variants, are released under the permissive Apache 2.0 license. While there are several AI models being released with permissive licenses, these are all faced with a number of challenges, which we will discuss below.
How does an open license help security and safety?
This relates to the core principles of open source. A permissive license allows more users to use and experiment with the model. This permissive license means more security and safety issues can be discovered, reported and, in most cases, fixed.
Open data
The term “large” in “large language model” (LLM) refers to the large amount of data required to train the model, apart from the many parameters that constitute the model. Model efficacies are often measured by the number of input tokens—often trillions for a good model—used to train the data.
For most closed models, the data sources used to pre-train and fine-tune the model are secret and form the very basis of differentiation from similar products created by other companies. We believe that for an AI model to be truly open source, it is important to reveal the data used to pre-train and fine-tune that model.
The corpus of data used to train Granite foundation models is documented in detail, along with the governance and safety workflows applied to the data before it is sent to the training pipeline.
How does open data help security and safety?
The kind of data generated by a large language model during inference depends on the data with which the model has been trained. Open data provides a way for community members to examine the data used to train the model and verify that no hazardous data is used in the pipeline. Also, open governance practices help reduce model bias, in that biases can be identified and removed from the pre-training phase itself.
Freedom to modify and share
This brings us to the challenges of models released with permissive licenses:
- Because of the way these models are created and distributed, it’s not possible to directly contribute to the models themselves. Since this is the case, these community contributions show up as forks of the original model. This forces consumers to choose a “best-fit“ model that isn’t easily extensible, and these forks are expensive for model creators to maintain.
- Most people find it difficult to fork, train and refine models, because of their lack of knowledge of AI and machine learning (ML) technologies.
- There is a lack of community governance or best practices around review, curation and distribution of forked models.
Red Hat and IBM introduced InstructLab, a model-agnostic open source AI project that simplifies the process of contributing to LLMs. The technology gives model upstreams with sufficient infrastructure resources the ability to create regular builds of their open source licensed models.
These resources are not used for rebuilding and retraining the entire model, but rather to refine that model through the addition of new skills and knowledge. These projects would be able to accept pull requests for these refinements and include those in the next build.
In short, InstructLab allows the community to contribute to AI models without forking them. These contributions can be sent “upstream,“ which allows the developers to rebuild the original model with the new taxonomy, which can be further shared with other users and contributors.
How does the freedom to modify and share help security and safety?
This allows community members to add their own data to the base model in a trustworthy way. They can also fine-tune the safety parameters of the model by using taxonomy, which adds additional safety guardrails. The community can also improve the security and safety posture of the model without repeating the pre-training, which is expensive and time consuming.
IBM and Red Hat are part of the AI Alliance, which is seeking to define what open source AI means at an industry scale in terms of governance, process and practice.
Open, transparent and responsible AI will help advance AI safety, giving the open community of developers and researchers the ability to address the significant risks of AI and mitigate them with the most appropriate solutions.
Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)