Benefits of Open-Source AI
Public Transparency and Democratic Accountability
Artificial intelligence (AI) decision-making can be inscrutable in ways that frustrate democratic accountability. The inability to understand how a model reached a decision on any given question presents problems for any process that needs to be auditable, repeatable, and generally subject to external scrutiny.1 The inability to understand why a model gave the answer it did makes AI unreliable for use in critical applications. By offering access to both the technical components of a model and transparency into the decision-making process of model developers, increased openness in AI offers part of the solution to this inscrutability.
While the technical aspects of this problem, known as explainability, are an important unsolved issue requiring attention, good transparency practices implemented by model developers can help in understanding how certain biases appear in a model and can keep developers accountable for the choices they make in designing and training AI models. Generative AI models trained on data scraped from the public internet often embody the errors and biases in that data, which exacerbates long-standing concerns about algorithmic bias and its discriminatory effects.2 Increased access to information about data used for training is a vital first step in understanding how bias arises in an AI model.
The risks of AI stem from more than the models—they include the whole systems in which those models are used. As Benjamin Brooks, a fellow at Harvard University’s Berkman Klein Center, recently observed in a filing before Australia’s Department of Industry, Science, and Resources, “the risk posed by the system will depend on a range of factors, including the intended use-case, specific deployment environment, the extent of human oversight, and the possibility of correction and redress.” Similar to the Open Source Initiative, Brooks suggests thinking of AI as a complete system and a model as “a component: a prediction engine.”3 In other words, the details about how a model is set up influence what possible risks it poses. Those risks can be introduced in a variety of ways that extend beyond the model itself, from insecurities in the deployment environment to having no system for correcting the AI when it makes a mistake. An AI system consists of both a model and a software project run by the people who maintain it.
Transparency and accountability comprise one part of addressing AI explainability by offering a framework for evaluating non-technical aspects of a model, such as project management, oversight, and decisions about technical design. Moving beyond what a model is doing technically enables a shift from only focusing on AI as a new, mysterious technology to helpfully identifying the aspects AI shares with other software. There are many lessons in open source about how to structure and organize a large, globally distributed, technical project in fair and transparent ways. The types of lessons that an AI model project could emulate range from community implementation of best practices for managing code, tracking bugs, codes of conduct, or even building a nonprofit’s organizational or legal structure around a single software project.
In addition, open models can be more easily modified and deployed in service of specific public-interest goals than closed models. As Divya Siddarth and Saffron Huang of the Collective Intelligence Project—an initiative that aims to direct technological development toward the collective good—and Audrey Tang, Taiwan’s first Minister of Digital Affairs, state, “The open-source community can play a large role here, partnering with democratic innovation organizations to train open models that align with public perspectives.”4 They emphasize that the key features of transparency and open innovation also bolster trust in AI and broaden the base of people who participate in shaping AI’s impact on society.5
Open code can deliver some transparency benefits on its own, but given the explainability challenge facing all of AI—open code, model weights, and training data should not be misconstrued as a silver bullet for achieving fully explained AI decision-making. Understanding how AI systems arrive at any given answer is still a puzzle that must be solved to build trust in AI. Open models, accompanied by other mechanisms of transparency, make more freely available all of the pieces that AI researchers will need to solve the puzzle of explainability.
But openness alone cannot democratize AI or equitably distribute its benefits throughout society, partly because of the concentration of power in the technology industry6 and also because privately run AI models cannot perfectly align with specific democratic or public-interest objectives. This realization has increased calls for the development of a “public” AI infrastructure. Like the spectrum between the concepts of open and closed systems, it is also possible to think of a spectrum that spans the concepts of wholly private and wholly public AI models.7 Proponents of public AI define the term differently but consistently note that government-owned or other non-corporate models can be made democratically accountable in ways that privately owned models cannot.8 While a public AI infrastructure could be built with closed AI models, in most contexts, the transparency that accompanies open models is aligned with many goals of public AI.
Unexpected Innovation and Competition
In 2008, Jonathan Zittrain, an American professor of internet law and the George Bemis Professor of International Law at Harvard Law School, published The Future of the Internet and How to Stop It. In it, he warned that the internet was losing the “generative” potential of its earlier years and growing increasingly controlled by a few powerful, private gatekeepers. Zittrain wasn’t using the term “generative” as many people now do with generative artificial intelligence (AI). He defined generativity as “a system’s capacity to produce unanticipated change through unfiltered contributions from broad and varied audiences.”9 He issued a warning that the internet’s generativity was under threat:
“The serendipity of outside tinkering that has marked that generative era gave us the Web, instant messaging, peer-to-peer networking, Skype, Wikipedia—all ideas out of left field. Now it is disappearing, leaving a handful of new gatekeepers in place, with us and them prisoner to their limited business plans.… Even fully grasping how untenable our old models have become, consolidation and lockdown need not be the only alternative. We can stop that future.”10
Those ideas “out of left field” proved profoundly transformative. The generative internet was worth protecting, but Zittrain’s prescient warning became a reality. “Web 2.0,” or the social media era of the internet, was dominated by the rise of large social media companies that became intermediaries for much of society’s online activity. The resulting environment is less generative. Internet monocultures have spread. A few corporations with access to massive amounts of data enjoy an outsized ability to determine the boundaries of online privacy, shape content creation and consumption, and narrow the range of experiences readily accessible online.
These same companies are at the forefront of today’s AI innovation race, and they are poised to extend their consolidation. But an ecosystem in which open models thrive alongside proprietary ones can culturally diversify the technology landscape, promote competition, and spur the kind of unexpected innovation that made the early internet such a powerful force for serving a variety of public-interest objectives.11
The past three decades of free and open-source software development have produced examples of lasting innovative impact in large, well-known projects, like the Linux kernel, and in smaller software projects built around the needs of specific communities. Time and again, the insights of people modifying open-source code to fit their own needs and open source’s tested value as a cost-effective foundation on which to build have advanced technological developments. The ability to examine and modify code has catalyzed innovative new technologies and changed the landscape of how software vulnerabilities are found and fixed.
Open source’s success, and the open standards that make up the internet, provide lessons that can make it easier for AI innovation to be generative in the way Zittrain meant it. As Leslie Daigle, former chair of the Internet Engineering Task Force Board’s oversight committee and internet architecture board, stated in a 2019 white paper: “The more proprietary solutions are built and deployed instead of collaborative open standards-based ones, the less the internet survives as a platform for future innovation.”12 A more recent essay makes a similar point by invoking an ecological imperative “to rewild the internet” in ways that allow spontaneity and broad-based innovation to flourish again.13 Encouraging the development of more open AI models can play a vital role in such a process. Society cannot foresee the specific innovations that open AI models will create—that is precisely the point—but the lessons of open-source software demonstrate that they will occur and that some will have broad, transformative impacts.
“Society cannot foresee the specific innovations that open AI models will create—that is precisely the point—but the lessons of open-source software demonstrate that they will occur and that some will have broad, transformative impacts.”
The Linux kernel serves as an illustrative and concrete example of how open source can have a broad yet unforeseen impact. Linux is everywhere in modern computing and is far more than a platform for tech enthusiasts; billions of non-technical users interact with Linux systems every day. Not only does Google have a long history of running Linux on its servers14 and having its own customized version of Linux for its developers,15 but both Google’s mobile operating system, Android, and its laptop operating system, ChromeOS, are built using Linux at their core. All Chromebook and Android users are running Linux. Many web hosting providers also use Linux-based systems that run a whole stack of open-source software to provide internet services, including web servers, databases, and remote code processing. These providers run the gamut in size, from small hosting providers to some of the biggest players in hosting, like Amazon Web Services. Open-source web servers make up roughly half of all web servers on the internet.16 Without Linux and popular open-source projects for tasks like running websites, the internet as we know it may have taken shape differently.
While Google, Amazon, Meta, Microsoft, and Apple certainly all have proprietary “closed source” software, those private tools are often built on top of or interact with open-source software. Recognizing the broader value of an open ecosystem, all of those companies have contributed code, paid labor, and financial or material support to open-source communities. Even one of the largest historical players in the closed-source software market, Microsoft, has deepened its embrace of open source over the last decade. Notably, the company has simplified the process of running certain open-source software in Windows and purchased GitHub, arguably the world’s largest repository of open-source code. There is a shared understanding in tech—from developers of small software projects to large corporate players—that open source has created a solid foundation on which tech innovations, both open and proprietary, are more easily built.
In 1991, when software engineer Linus Torvalds first released the Linux kernel, he could not have predicted that in over three decades, the software would be a cornerstone of the internet or run on billions of smartphones. The year 1991 was still relatively early in the personal computing revolution and the internet was in its commercial infancy; the future was not clear. To the extent that society is in the midst of an AI revolution, that revolution is still in its early days. While experts can speculate about possible paths in AI’s development, they should do so with an acknowledgment that unexpected paths are inevitable. By analogy to the 1990s, people are using thirty-pound desktop PCs on dial-up internet and trying to imagine the era of smartphones. There is simply no way to clearly forecast all the surprising ways in which people will use AI over the next 20 years, but based on historical trends, it is very likely that open models will sit at the foundation of some of the biggest advancements in AI.
Importantly, open models built for all sizes and purposes—not simply large foundation models dominating discussion today—will spur critical innovation.17 As smaller, bespoke models emerge and have transformative impact, the generative potential of open-source innovation in the AI context will become increasingly clear. Open models can be designed and modified to address needs in fields as diverse as medicine, cybersecurity, urban planning, and climate change. A few dominant proprietary models can perhaps be adapted to tackle a subset of these problems, but the interests of large corporations may not naturally align with context-specific applications that serve the public interest. A healthy open-source ecosystem is far more conducive to communities’ ability to define problems and solutions on their own terms.
As has been the case with open-source software, open AI models can allow people from various means and backgrounds to respond to use cases in their communities that aren’t being considered by the private sector and fill the technical gaps they identify. For example, the OpenCellular project is an open-source effort aimed at allowing communities that are not currently served by mobile network operators (MNOs) to form their own MNOs by distributing both the software to build them and hardware schematics.18 Open-source models can similarly provide a toolset for people who find uses for AI in places where larger AI companies like OpenAI, Anthropic, or Google may not be looking or incentivized to look.
This kind of innovation requires enough time and space to take shape and scale. While the impact of many innovations—like Linux—that are attributable to open source can easily be identified in hindsight, they were not obvious when created. Given the space to develop and the ability to interact with real-word use cases, open-source AI models can follow a similar trajectory as software and catalyze creative new uses.
We must acknowledge that while open models’ innovative potential can contribute to a more competitive AI software ecosystem, they will still exist alongside market concentration elsewhere in the AI supply chain. This is particularly true at the infrastructure level, where there are large players who dominate in hardware and cloud computing. Proponents of open models should understand the resource challenges that smaller players currently face in building and training AI, but research toward less resource-intensive model training19 combined with open-source software’s history of enabling competition will produce a more innovative AI ecosystem.
We should not overstate open AI models’ ability to serve as a panacea for stopping the consolidation of AI. But the history of open-source software has shown that openness leads to a more innovative environment and brings competitive benefits to the entire tech ecosystem. This lesson should find an analog in AI. Open-source AI models can draw from the history of open-source software, anticipating that the iterative creativity that has brought hard-to-foresee innovation in open-source software for decades will bring about impactful new uses for AI, particularly because those contributions might come from the most unexpected places.
Educational and Research Purposes
One of the key benefits of open source has been its ability to lower the barriers people face when learning how technologies work. The ability to download, study, and freely modify code has led to a wide availability of open-source programming languages and training materials. This has made possible both new forms of training, like coding bootcamps, as well as new avenues for computer science research. Lowering barriers is not only about cost, but also about the ability to freely use and modify code for both training and research purposes.
Open-source models are vital to democratizing education about artificial intelligence (AI) as a technology and avoiding the concentration of knowledge among a small group of people granted access by a few companies to their closed models. Open models and code empower a wider range of people to access these technologies in ways that allow them to gain a hands-on understanding of how they work. Enabling a wider variety of researchers, students, technologists, and hobbyists—along with companies—to examine, run, and modify AI models in unrestricted ways will lead to greater insights about the technology. Such insights are possible when people are equipped with access to a technology and can form deep knowledge of that technology based on their experiences using it.
Consolidating AI around a few large players with closed-source models runs the risk of confining most high-level AI skills and expertise within the walls of the large tech companies that build those models. Without open models, there would be fewer opportunities for those who are not employed by an AI company—or in that company’s approved training pipeline—to learn about the fundamentals of AI technologies. Open models can create multiple pathways to learning AI.
Furthermore, the ability to examine and modify how model code works will lower barriers to academic research on AI. A wider variety of people researching the technology will produce active exploration about a larger and more diverse set of questions about AI.
Indeed, open models may already be impacting AI research, where researchers have already designed experiments using existing open-source AI models to advance the general understanding in the field. For example, Carnegie Mellon and Apple researchers, in 2024, used Mistral’s open-source models to explore ways of creating higher quality training regimens for large-language models using synthetic data.20 What they found was a method for producing more accurate models, using a smaller training corpus and spending less overall time training. This type of research, which could make model training more efficient (and thus less resource intensive), was only possible because the researchers could conduct experiments using an open model. Because they used open tooling and clearly described their methodologies, their more efficient method for producing more accurate models can be replicated, expanded, and used collectively to strengthen AI models for everyone.
In a seminal 1999 article, Lawrence Lessig, the Roy L. Furman Professor of Law and Leadership at Harvard Law School, described the “Open Source Software Moment” taking shape at the time.21 Laying out the ideals of open source, he observed that “putting into the commons one’s work product—of giving away what one makes” might seem “alien to our tradition” but actually functioned much like science where “progress [is] made and given to the next generation.”22 Open models can create multiple pathways to learning AI, and they become a critical part of ensuring AI knowledge is more accessible to people who want to learn about it, whether they are hobbyists or professional researchers. They also make it easier for all of those people to share what they learn with others.
Mitigated Security Risks
Commentary about security in artificial intelligence (AI) often simplistically equates greater model openness with greater risk. Some commentary goes further, claiming that open-source AI is “uniquely dangerous”23 when compared to closed models like ChatGPT. All AI models carry security risks, but imprecise claims about open AI models miss the greater nuance required when discussing the security risks of AI and how those risks differ between open and closed models. As the National Telecommunications and Information Administration (NTIA) noted in a report required by Executive Order 14110 on “the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,” it is more appropriate to analyze the marginal risks that open models present when compared to closed models and information that is publicly available online.24
There is a broad list of harms that could come from AI, including widespread disinformation, but researchers have concluded there is much less evidence about the marginal risks posed by open models.25 Academic researchers have highlighted certain areas, like the computer generation of child sexual abuse material,26 where there is evidence that open models pose concerning extra risks. We do not suggest that developers, policymakers, or researchers should take a cavalier approach to such risks but instead propose that they should rigorously study, monitor, and mitigate against marginal risks that do emerge in open models.
The NTIA’s report espouses this view, concluding that “the government should not restrict the wide availability of model weights for dual-use foundation models at this time.”27 Instead, the NTIA calls for building governmental capacity for monitoring risks and better understanding the benefits of open models. The NTIA report highlights open models’ ability to benefit security, including in cyber deterrence and defense, advancing safety research and identifying vulnerabilities, and promoting transparency and accountability through third-party auditing.28
AI comes with security concerns, and a responsible approach to building and maintaining any AI model requires vigilance in monitoring potential security vulnerabilities. Security concerns germane to AI fall into three broad categories. The first consists of security concerns that resemble the vulnerabilities that have long been present in other kinds of software development. The second consists of concerns about the downstream uses of AI, which concern how the technology is used rather than how it is developed. The third consists of AI models that could be used to fuel novel cyberattacks.
All code is at risk of having vulnerabilities, regardless of whether developers use an open or closed model for licensing it, and this is likely to hold true even in AI. As researchers at the Wilson Center have noted, “Vulnerabilities can come from dependency management (what, how, and which software packages are pulled into a new software project) to bad-faith actors (people that intentionally break into systems, or contributors intentionally changing the software to be exploitable) whether the software is developed internally or in the open.”29
An additional source of vulnerability that could be added to this list is simple human error. For example, in some coding languages, an error as small as forgetting to put quotes around a single variable (e.g., $FOO versus “$FOO”) could create a vulnerability in the code. The more complex a project’s codebase becomes, the more likely that human error will appear somewhere in that project’s code or even in one of its dependencies (and this risk increases with every new dependency).
Software vulnerabilities can occur whether or not the source code is open or closed. With AI, as with all software, there is a strong chance that some of the code running it will have errors that introduce vulnerabilities, regardless of the AI model’s licensing. However, discourse around open-source models must also account for the ways in which they will confer key security benefits long recognized in open-source software.
Actors from the U.S. private and public sectors have long recognized the ways in which open-source software is central to security. Microsoft’s 2023 Digital Defense Report emphasizes open-source software’s benefits to cybersecurity, noting that the nature of open-source collaboration is key to mapping the threat landscape and scaling responses to those threats.30 Notably, open-source software has long been critical to the Department of Defense, and military modernization efforts call for broadening the use of open-source software and properly securing it.31 The Cybersecurity and Infrastructure Security Agency has also highlighted open source’s relationship to strong security in the context of open-source AI models, that “the general consensus among the security community is that the benefits of open sourcing… outweigh the harms that might be leveraged by adversaries.”32
The ability to modify codebases and run AI models independently of their creators is something unique to open models. While some closed models may allow “red teaming” and have a method for bug reporting, security testers may still face constraints on what they can inspect about the model or restrictions on the types of attack they can simulate and when they can simulate them.
By contrast, running code on machinery that is not owned by a model’s writer allows security researchers to use a number of security analysis techniques and methods (such as simulating brute force attacks) that might not be available or allowable when evaluating a publicly available closed model like ChatGPT or Gemini. Such independent security research could uncover ways to manipulate model outputs or alter how an AI system makes choices, which would then make future versions of those models more impervious to such attacks.
In addition to allowing researchers to improve models, the ability to examine an AI model by completely accessing its model weights, training data, and code also allows for formal security vetting by independent third parties. Third parties, whether governmental or private, bring their own goals and lend their reputation to such reviews, which can increase public trust in a model.
Bugs and vulnerabilities will exist in AI code, as they do in all code, but given the strong track record of security in open source, there is a whole class of risks that aren’t substantially new or different—such as the environment where the model is run—and there are many well-established methods for addressing them.
The risks that are novel with AI generally manifest “downstream” from the technology itself. That is to say, the harm happens as a result of the application of AI rather than the AI itself. There is much well-founded concern about the use of AI in supercharging disinformation campaigns, but thus far, there is no clear evidence of differential impact based on whether the AI used in such campaigns are open or closed models. Indeed, Microsoft’s own research team notes that large-language models like ChatGPT, which runs on a closed foundation model, have been weaponized by adversary governments, including China.33
There are legitimate security concerns when it comes to AI. These concerns deserve attention and action that addresses them. However, to ensure that a narrow security lens isn’t used to hinder the development of an open AI ecosystem, experts must precisely identify where open models present risks beyond publicly available closed models or information that is publicly available online.
Citations
- See, e.g., Jaden Fiotto-Kaufman, Alexander R. Loftus, et al., “NNsight and NDIF: Democratizing Access to Foundation Model Internals,” arXiv, July 18, 2024, source.
- See, e.g., Spandana Singh, Charting a Path Forward: Promoting Fairness, Accountability, and Transparency in Algorithmic Content Shaping (New America’s Open Technology Institute, September 9, 2020), source.
- Benjamin Brooks, “Consultation on Safe and Responsible AI in Australia,” Department of Industry, Science and Resources, October 4, 2024, source.
- Divya Siddarth, Saffron Huang, and Audrey Tang, “A Vision of Democratic AI,” Digitalist Papers, September 22, 2024, source.
- Siddarth, Huang, and Tang, “A Vision of Democratic AI,” source.
- “We find that even though there are a handful of meaningfully transparent, reusable, and extensible AI systems, these and all other ‘open’ AI exists within a deeply concentrated tech company landscape. With scant exceptions that prove the rule, only a few large tech corporations can create and deploy large AI systems at scale…Given the immense importance of scale to the current trajectory of artificial intelligence, this means ‘open’ AI cannot, alone, meaningfully ‘democratize’ AI, nor does it pose a significant challenge to the concentration of power in the tech industry.” Widder, West, and Whittaker, “Open (For Business),” source.
- Nik Marda, Jasmine Sun, and Mark Surman, Public AI: Making AI Work for Everyone, By Everyone (Mozilla, September 2024), 7, source.
- See, e.g., Marda, Sun, and Surman, Public AI, source; Public AI Network, “Public AI,” source; Sitaraman and Pascal, “The National Security Case for Public AI,” source; Sanders, Schneier, and Eisen, “How Public AI Can Strengthen Democracy,” source.
- Zittrain, The Future of the Internet, source.
- Zittrain, The Future of the Internet, x, source.
- Rishi Bommasani, Sayash Kapoor, et al., “On the Societal Impact of Open Foundation Models,” arXiv, February 27, 2024, source.
- Leslie Daigle, The Internet Invariants: The Properties Are Constant, Even as the Internet Is Changing (Thinking Cat, May 16, 2019), 40, source.
- “[The] unpredictability [of…] internet infrastructure makes it generative, worthwhile, and deeply human.” Maria Farrell and Robin Berjon, “We Need to Rewild the Internet,” Noēma Magazine, April 16, 2024, source.
- Marc Merlin, “Live Upgrading Thousands of Servers from an Ancient Red Hat Distribution to a 10-Year Newer Debian Based One,” presented at the Large Installation System Administration Conference, Washington, DC, November 3–8, 2013, source.
- Kordian Bruck, Margarita Manterola, and Sven Mueller, “How Google Got to Rolling Linux Releases for Desktops,” Google Cloud (blog), July 12, 2022, source.
- “October 2024 Web Server Survey,” Netcraft, October 31, 2024, source.
- James Thomason, “Why Small Language Models Are the Next Big Thing in AI,” Bloomberg, April 12, 2024, source.
- “OpenCellular,” Telecom Infra Project, source.
- Paolo Faraboschi, Ellis Giles, Justin Hotard, Konstanty Owczarek, and Andrew Wheeler, “Reducing the Barriers to Entry for Foundation Model Training,” arXiv, October 14, 2024, source.
- Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly, “Rephrasing the Web: A Recipe for Compute & Data-Efficient Language Modeling,” arXiv, January 20, 2024, source.
- Lessig, “Open Code and Open Societies, 104, source.
- Lessig, “Open Code and Open Societies,” 1411–1412, source.
- David Evan Harris, “Open-Source AI Is Uniquely Dangerous,” Spectrum, January 12, 2024, source.
- Dual-Use Foundation Models, 10, source.
- Bommasani, Kapoor, et al., “On the Societal Impact of Open Foundation Models,” 6, source.
- David Thiel et al., Generative ML and CSAM: Implications and Mitigations (Stanford Internet Observatory, June 24, 2023), 7–8, source.
- Dual-Use Foundation Models, 36, source.
- Dual-Use Foundation Models, 17, source.
- Ashley Schuett, Alison Parker, and Alex Long, “Open Source Software and Cybersecurity: How Unique Is This Problem?” CTRL Forward (blog), Wilson Center, November 10, 2022, source.
- Microsoft Threat Intelligence, Microsoft Digital Defense Report 2023 (Microsoft, October 2023), 116, source.
- Ben FitzGerald, Jacqueline Parziale, and Peter L. Levin, Open Source Software and the Department of Defense (Center for a New American Security, August 30, 2016), source.
- Jack Cable and Aeva Black, ”With Open Source Artificial Intelligence, Don’t Forget the Lessons of Open Source Software,” Cybersecurity and Infrastructure Security Agency, July 29, 2024, source.
- Microsoft Threat Intelligence, “Staying Ahead of Threat Actors in the Age of AI,” Microsoft Security, February 14, 2024, source.