{"id":9825,"date":"2025-03-06T03:00:00","date_gmt":"2025-03-06T08:00:00","guid":{"rendered":"https:\/\/www.both.org\/?p=9825"},"modified":"2025-03-03T08:45:02","modified_gmt":"2025-03-03T13:45:02","slug":"cautions-when-using-ai-for-coding","status":"publish","type":"post","link":"https:\/\/www.both.org\/?p=9825","title":{"rendered":"Cautions when using AI for coding"},"content":{"rendered":"<div class=\"pld-like-dislike-wrap pld-template-1\">\r\n    <div class=\"pld-like-wrap  pld-common-wrap\">\r\n    <a href=\"javascript:void(0)\" class=\"pld-like-trigger pld-like-dislike-trigger  \" title=\"\" data-post-id=\"9825\" data-trigger-type=\"like\" data-restriction=\"cookie\" data-already-liked=\"0\">\r\n                        <i class=\"fas fa-thumbs-up\"><\/i>\r\n                <\/a>\r\n    <span class=\"pld-like-count-wrap pld-count-wrap\">    <\/span>\r\n<\/div><\/div>\n<p>Generative AI is a hot topic these days, and people are finding new ways to leverage large language model (LLM) systems to streamline processes. One way that people have put AI to work is in <em>writing code<\/em>, such as with AI \u201cassistants\u201d like GitHub\u2019s Copilot.<\/p>\n\n\n\n<p>While AI co-authors can help remove some of the boring parts of writing code (how many times do we need to write <em>another<\/em> implementation of \u201cread a data file into memory\u201d?) I think developers need to keep in mind the limitations of using AI in this way.<\/p>\n\n\n\n<p>The biggest drawback in using AI for coding is that <em>AI was trained on other work<\/em> and the \u201cgenerative\u201d nature of LLMs means that the AI will sometimes echo or repeat its input. At a high level, this presents two major risks, which can impact developers working either on open source projects or in a proprietary project:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Open source<\/th><th>Proprietary<\/th><\/tr><\/thead><tbody><tr><td>AI includes incompatibly-licensed open source code<\/td><td>AI inserts copyleft-licensed code into a proprietary codebase<\/td><\/tr><tr><td>AI copies proprietary code into your open source project<\/td><td>AI echoes other proprietary code into your closed project<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Let\u2019s look at the two major concerns and how they affect both open source and proprietary software projects:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ai-inserts-open-source-code\">1. AI inserts open source code<\/h2>\n\n\n\n<p>Copilot and other AI coding \u201cassistants\u201d lure developers with the promise that AI can automate the development cycle. For example, GitHub\u2019s <a href=\"https:\/\/github.com\/features\/copilot\">Copilot features page<\/a> says developers can \u201cAsk GitHub Copilot a question, get the right answer for you, and accept the code with a single click\u201d and that \u201cGitHub Copilot generates what you need\u2014so you can build faster.\u201d<\/p>\n\n\n\n<p>However, AI always starts with <em>training data<\/em>, and that training data had to come from <em>somewhere<\/em>. GitHub\u2019s Copilot had a jump-start in this area because GitHub hosts so many repositories, both open source (free accounts) and proprietary (using GitHub Enterprise). This provided Copilot with a wealth of training data. Drawing from this broad set of training data, Copilot can find inventive solutions to coding challenges.<\/p>\n\n\n\n<p>Unfortunately, this training\u2014combined with the \u201cgenerative\u201d nature of AI\u2014means that Copilot can also insert copies from other projects into your code. One memorable example was shared by Armin Ronacher in 2021, showing how GitHub\u2019s Copilot \u201cautocompletes\u201d the <a href=\"https:\/\/x.com\/mitsuhiko\/status\/1410886329924194309\">fast inverse square root implementation<\/a> from Quake III. Id Software released the <a href=\"https:\/\/github.com\/id-Software\/Quake-III-Arena\">Quake III Arena source code<\/a> in 2012, so it was likely included in Copilot\u2019s training data. Copilot inserted the code into Ronacher\u2019s code session, adding a copy of the <a href=\"https:\/\/opensource.org\/license\/bsd-2-clause\">BSD 2-clause License<\/a>, also called the \u201cSimplified BSD License\u201d or the \u201cFreeBSD License.\u201d However, as Stefan Karpinski noted in a <a href=\"https:\/\/x.com\/StefanKarpinski\/status\/1410971061181681674\">followup comment on X<\/a>, Id Software actually released Quake III under the <a href=\"https:\/\/github.com\/id-Software\/Quake-III-Arena\/blob\/master\/COPYING.txt\">GNU General Public License, version 2<\/a>. Karpinski also highlighted that Copilot\u2019s inserted comment attributed the wrong person as the copyright holder.<\/p>\n\n\n\n<p>The critical detail is that while the Free Software Foundation lists the FreeBSD License as <a href=\"https:\/\/www.gnu.org\/licenses\/license-list.en.html#FreeBSD\">compatible with the GNU GPL<\/a>\u2014meaning code released under the FreeBSD License can be included in projects covered by the GNU GPL\u2014the reverse is not true. The FSF notes the FreeBSD License is a \u201clax, permissive non-copyleft free software license.\u201d Anyone can use source code licensed under FreeBSD, without attribution, including in proprietary or \u201cclosed source\u201d projects. In contrast, the GNU GPL requires that any program that uses code released under the GNU General Public License must also be released under the GPL. This also requires that the source code be made available; the GNU GPL version 2 says:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<ol start=\"3\" class=\"wp-block-list\">\n<li>You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:<\/li>\n<\/ol>\n\n\n\n<ol style=\"list-style-type:lower-alpha\" class=\"wp-block-list\">\n<li>Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,<\/li>\n\n\n\n<li>Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,<\/li>\n\n\n\n<li>Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)<\/li>\n<\/ol>\n<\/blockquote>\n\n\n\n<p>The downstream effects of Copilot inserting open source code can have a huge impact.<\/p>\n\n\n\n<p><strong>For open source projects,<\/strong> maintainers need to understand the origin of every source code contribution. This has usually meant <em>contributed by other developers<\/em>, but in the era of AI code assistants, developers need to consider if code generated by an LLM might have originated from an open source project. And if so, <em>is the license of the inserted code compatible with your project?<\/em><\/p>\n\n\n\n<p>That might mean AI inserting code covered by a free software license that is <a href=\"https:\/\/www.gnu.org\/licenses\/license-list.en.html#GPLIncompatibleLicenses\">incompatible with the GNU GPL<\/a> into a project\u2019s codebase that is actually licensed under the GNU GPL. Or it could mean an AI assistant inserting code released under the GNU GPL into a project that is covered by another, incompatible open source license.<\/p>\n\n\n\n<p><strong>For proprietary projects,<\/strong> the problem is made worse by the threat that an AI coding agent might insert code that was originally licensed under the GNU GPL, but without attribution or warning of the original license. In this case, if the issue were eventually uncovered (such as via an audit) the company would need to halt all distribution and sales of the software product until the full codebase can be fully investigated and any offending code contributions rewritten from scratch.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ai-lifts-from-proprietary-source-code\">2. AI lifts from proprietary source code<\/h2>\n\n\n\n<p>The other worst-case scenario is an AI coding assistant inserting code that originated from a proprietary codebase. While I have not seen reports of this happening, I believe it is a matter of \u201cnot yet.\u201d<\/p>\n\n\n\n<p>Consider the Copilot example. GitHub trained Copilot on projects hosted at GitHub. And while GitHub claims that Copilot <a href=\"https:\/\/github.com\/features\/copilot#faq\">does not \u201ccopy\/paste\u201d code<\/a>, Microsoft also admits that Copilot can \u201cgenerate code suggestions <a href=\"https:\/\/learn.microsoft.com\/en-us\/answers\/questions\/1696286\/github-copilot\">based on patterns and examples<\/a> it has seen in public code\u201d and that \u201cthere is a possibility that suggestions might closely resemble existing public code snippets due to the nature of the training data.\u201d<\/p>\n\n\n\n<p>In the same reply, Microsoft advises that \u201cUsers should review and validate the suggestions provided by Copilot to ensure they meet their specific requirements and adhere to intellectual property laws\u201d and \u201cFor enterprises, it\u2019s important to consider regulatory compliance and internal policies regarding the use of AI-powered tools like Copilot.\u201d This shifts the onus to developers to ensure that the source code generated by an AI coding assistant does not violate someone else\u2019s intellectual property.<\/p>\n\n\n\n<p><strong>For proprietary projects,<\/strong> an AI coding agent inadvertently inserting another organization\u2019s proprietary code may not present an immediate risk. Even if this were to happen, the risk of discovery is much lower due to the \u201cclosed source\u201d nature of proprietary software development. However, with an <a href=\"https:\/\/www.gao.gov\/cybersecurity\">increase in most types of cyberattacks<\/a> and <a href=\"https:\/\/www.crowdstrike.com\/en-us\/global-threat-report\/\">cyberattacks on the rise<\/a>, including ransomware attackers publicly posting proprietary data and source code, the threat of another organization discovering code copied (from another proprietary codebase) by an AI agent remains.<\/p>\n\n\n\n<p><strong>For open source projects,<\/strong> there is a nonzero risk that AI-generated code and merged into an open source software project might have been copied from proprietary code. This risk might be very small, especially with Microsoft\u2019s claims that specific code is <a href=\"https:\/\/www.youtube.com\/watch?v=UbTlzJm6uv8\">not used to train Copilot<\/a>, but the risk is still there.<\/p>\n\n\n\n<p>This concern is unfortunately one-sided for open source projects, because open source is necessarily <em>in the open<\/em> where anyone can <a href=\"https:\/\/www.gnu.org\/philosophy\/free-sw.en.html\">study how the program works<\/a>. That includes review by companies who might discover re-use of their proprietary code, even when unintentionally and unknowingly inserted by an AI co-author. If left unresolved, the project\u2019s developers might be the subject of a costly lawsuit. Open source developers want to write software, not be the next <a href=\"https:\/\/en.wikipedia.org\/wiki\/SCO%E2%80%93Linux_disputes\">SCO v Linux<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While AI agents can help streamline development, keep in mind these cautions when using AI for coding.<\/p>\n","protected":false},"author":33,"featured_media":7679,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_lmt_disableupdate":"","_lmt_disable":"","footnotes":""},"categories":[307,150],"tags":[123,152],"class_list":["post-9825","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-programming","tag-ai","tag-programming"],"modified_by":"David Both","_links":{"self":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9825","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9825"}],"version-history":[{"count":5,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9825\/revisions"}],"predecessor-version":[{"id":9830,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/posts\/9825\/revisions\/9830"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=\/wp\/v2\/media\/7679"}],"wp:attachment":[{"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9825"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9825"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.both.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}