ChatGPT exposes a capability abyss that open source can't close
As impressive as ChatGPT is, it raises a big red flag. Chillingly, the outcome of The Turing Test —that a machine must be considered intelligent or conscious if a human can't tell that the conversation is with a machine — is not the biggest. The main problem is the capabilities needed to create a competitor. As it is now, ChatGPT is far from open source. It's unlikely that something similar will be any time soon, and that's an immense problem for innovation and democratized access to technology. Why? Read on.
I am as impressed as anyone with the abilities of ChatGPT. Many of us have used the web-based interface to get incredible, surprising, and ridiculous answers to our questions. Most of us are awed by the capabilities. But the ChatGPT service is more than just a recreational tool sporadically used to see what it can do; it will also have a paid API that developers can use to implement conversation-based AI into their products.
Some day someone else will have produced something even better, and developers will migrate to that. As long as the vendors keep this open and reasonably priced, many of us will be happy. But I have always preferred open-source products when possible. Will I find free and open-source alternatives, free alternatives of the software necessary to train and use these models, and open-source pre-trained models?
I find that very unlikely; at best, a few parts will be made accessible. Granted, Facebook has made a 175 billion parameter model available under an almost-but-not-quite-free license; as of now, the software license only permits use for research. The restrictive license even prohibits researchers from distributing models created by the software or based on its notes and documentation.
So, for now, we can only expect paid access. Training the GPT3 model that ChatGPT is based on is prohibitively expensive and time-consuming: The authors of this report (see also the Reddit discussion about it) have tried to calculate how expensive it was. They claim that even using the cheapest-priced GPU cloud provider, the cost would be around $4,600,000. If an individual wants to do this, even if they forked up $18,900 to purchase the fastest GPU/AI chip on the market, the training would take 355 years (to be fair, Facebook claims that their model was trained using merely eighteen of the $18,900 GPUs mentioned above).
The $4,600,000 doesn't include the cost of acquiring the number of documents/web pages to train the model on. I haven't seen anywhere how many web pages were necessary to train GPT3, but the previous model — GPT2 — was based on 8 million web pages. This amounted to 1.5 billion parameters. GPT3 consists of 175 billion parameters. If this means that the number of web pages or documents underlying GPT3 increased tenfold, too, I don't know. But I think we can confidently say that the amount increased massively.
It's safe to assume this process also is very time and money-consuming. Some work is needed to weed out junk documents from this, which increases the cost even more. How much is difficult to estimate, but let's stipulate $400,000. Then the total cost adds up to $5,000,000. And this is before you can even start developing a ChatGPT system.
OpenAI may have paid less. Microsoft, one of their primary investors, provided the necessary computing power using their Azure cloud service; Microsoft has also obtained an exclusive license to GPT3. Code generation tools based on GPT3 are already built into VSCode, their open-source programming IDE. Lately, there has been speculation that Microsoft will add GPT3-based text generation to Microsoft Word. The technology (remember, exclusively licensed) will perhaps become ubiquitous in Microsoft's vast range of products.
In addition, Microsoft is rumored to invest $10 billion in OpenAI at a valuation of $29 billion. Microsoft is already an investor, leaving Microsoft with 49 % ownership of OpenAI. This puts a price tag on how strategically important AI has become for Big Tech companies.
Open source enabled the Internet, computing, and innovation, by creating alternatives, and arguably better alternatives, to commercial software in almost any category. Very few, if any, of the large and small apps and services you use daily don’t depend on significant amounts of open-source software.
Historically, some of the important open-source projects have been:
- Apache replaced most commercial web server software (nginx, another open-source product, has since grown to equal Apache in market share).
- The cache software Varnish made large-scale web services possible (the porn industry was among its earliest adaptors).
- LibreOffice became a viable alternative to Microsoft Office. Parts of it have even found their way into other open-source products, such as dictionaries and spell checkers.
- For many, MySQL and Postgres replaced enterprise database software such as Oracle and Microsoft SQL Server. Although these competitors still exist, a third competitor, Sybase, is all but exterminated from the market; even its owner, SAP, hasn’t used its product in almost a decade.
- Open-source programming languages have all but replaced any closed-source products, removing the problem of incompatible vendor-specific implementations.
- V8, Google’s Chrome JavaScript engine, now powers many other products. Most important are Node.js and apps/desktop software such as Microsoft Teams and Slack.
- Elasticsearch and similar products gave individuals and companies the tools necessary to build private, in-house search engines — on a smaller scale, of course, but similar to what Google did with the web.
- Google’s Tensorflow made deep learning accessible for non-specialists.
- OpenSSL has become the technology that powers most of the encryption and security you use today, from secure communication to web servers, VPN, and conversations you have on your phone — whether it’s text messages or audio and video calls.
- The GNU software consists mostly of command-line utilities for Unix-like systems and software such as the MatLab GNU software and PSSP, a statistical tool similar to SPSS.
- And, most important of all, Linux became a free — in any sense of the word — alternative to Microsoft Windows, macOS, Solaris, etc. It didn’t just enter the server space but also spread to desktop computers. From there, Linux found its way into tablets and cell phones, fridges and home entertainment systems, GPS systems, satellites and Mars missions, powerful supercomputers, software containers (if you run them on macOS or Windows, the software internally runs on Linux), and even miniature embedded hardware. Name a category, and I all but guarantee you it’s there.
In short, open-source technologies have enabled most of the innovation we’ve seen in the last 25 years. Small businesses have created products with very little investment, and some have become large companies. Large corporations have benefited, too: MacOS is not the same OS it was when Linux was conceived; the macOS of 2023 has become a near-perfect amalgamation of proprietary software and open-source software underpinning the whole thing. Ironically, many of these technologies underpin the systems that created GPT and ChatGPT, too; I can almost guarantee you that the training was done on large Linux clusters.
Linux and related open-source software also made low-cost hardware valuable and available to even more people than before, both to low-income families and individuals in the west as well as to inhabitants in third-world countries. A couple of hundred dollars will now buy you a Raspberry Pi, mouse, keyboard, and a cheap screen; equipment good enough for entertainment, productivity, and developing the Next Big Thing.
Unless the Next Big Thing depends on large-scale deep learning, of course.
Building a conversation-based model on top of a language model is not only a different playing field. It is a different game altogether. Wikipedia explains the ChatGPT development process like this:
ChatGPT was fine-tuned on top of GPT-3.5 using supervised learning as well as reinforcement learning.[4] Both approaches used human trainers to improve the model’s performance. In the case of supervised learning, the model was provided with conversations in which the trainers played both sides: the user and the AI assistant. In the reinforcement step, human trainers first ranked responses that the model had created in a previous conversation. These rankings were used to create ‘reward models’ that the model was further fine-tuned on using several iterations of Proximal Policy Optimization (PPO).
As you can see, adding a conversation-based user interface to GPT3 takes a lot of human effort. I have yet to see that cost estimated anywhere, but I wouldn't be surprised if it approaches the cost of training GPT3. And even if the technology cost of training the AI models goes down over time, a certain level of human interaction will still be necessary for creating a conversation interface.
All of this means that it is improbable that a present-day Linus Torvalds-like student has started tinkering with their own conversation-based AI that will be offered under a free license. Such an effort would require systems and people to oversee it from the get-go — lots of it. It seems unlikely that such an effort could be managed from a dorm room.
It's more likely that a commercial operator would fund and manage such a project to disrupt and destroy the market for its competitors. But at the moment, that too seems unlikely, as even Google is shaken to the core by ChatGPT.
Business Insider reports:
Google’s management has issued a “code red” amid the launch of ChatGPT— a buzzy conversational-artificial-intelligence chatbot created by OpenAI — as it’s sparked concerns over the future of Google’s search engine.
Should Google be able to build something like ChatGPT into their search engine, I'm sure they would look upon that capability as part of their search engine's secret sauce. As with Coca-Cola, the obvious corporate instinct is that this is a recipe you want to keep private.
Even if a grass-roots open-source project managed to create something remotely similar to GPT3 and ChatGPT, what's considered best in class is increasing all the time. Yes, GPT3 is the best we have now, but as discussed above, GPT4 is already in development. It's speculated that it will be released in a few months (this was written in January 2023).
OpenAI has said that GPT4 won't be much bigger than GPT3 if you count parameters. Instead, it's hinted that the most significant jump in quality will come from using significantly more computing power when training the new model. And that power, as discussed, is even harder to come by (the 18 Tesla V100 GPUs that Facebook used to create its language model will probably be dwarfed by whatever GPT4 demands).
It’s reasonable to expect GPT4 to have more features and functionality than earlier versions. There are speculations that the model will have Q&A features built-in (think a GPT3+ChatGPT all-in-one solution) and the ability to generate images and video based on text input. What is true and what is wishful thinking is difficult to determine now. It doesn't take psychic powers to predict that it will significantly improve on GPT3, and you can safely assume that it won’t be open and free software.
And if the best we end up with is that these companies make their models available through APIs, we’d still have the problem of bias. This is a problem because you must take what you’re given with limitations to how much you can skew the bias in another direction.
Examples of biases are:
- What’s considered an acceptable level of nudity differs between regions.
- What’s regarded as acceptable language differs between cultures.
- What’s politically acceptable varies by country.
In addition, we have biases that AI models implicitly pick up when being trained. Remember Tay? It only took 24 hours for it — a Microsoft-made AI Twitter bot — to become a foul-mouthed racist.
In short, if you want your model to have a bias, it’d better be your bias.
One could argue that my outlook on this is too bleak, but I can’t help thinking that this has made the control over computing power and the corresponding real-life power you get from it more imbalanced than ever. This is the imbalance that open source initiatives, such as the GNU project, Linux, and others, tried to even out — with great success.
Can something be done? Frankly, I don’t know.
The EU will probably launch initiatives to create European AI labs to compete with the US and China-based mastodonts. They always do, but I’ve yet to see such initiatives succeed.
Alternatively, one could imagine a well-funded non-profit organization/foundation managing a project like this. However, I think that successful non-profit organizations grow out of already established and arguably successful products — the Mozilla organization comes to mind. But where Mozilla is funded by the revenue they get from search engines, it’s harder to see how an open AI initiative would support itself initially.
Given all this, the best thing one could achieve is not an entity that creates software and models for others to use. I’d look for a solution that cheaply, or even freely, provides the computing power needed for training large deep learning models such as GPT and its cousins.
If you were around in the early 2000s, you might remember SETI@home. This screen saver used your computer when you didn’t use it to analyze radio telescope data. The result was an extensive, distributed network of computing power that looked for signs of life in the voluminous amounts of data these telescopes had amassed.
What would happen if someone made something like this today and used it to create a GPTx-like model available under a liberal open-source license? It would not be a high-speed solution — data transfer rates would be abysmal compared to those within OpenAI’s large Azure cluster. The processors available would probably be puny in comparison too.
But minimizing the capability abyss is imperative for innovation, paramount for giving consumers freedom of choice, and absolutely necessary to avoid skewed or biased models caused by their creators’ views. If the end goal is so clear and worthy that enough people signed up to donate their idle computing power, we’d maybe end up with a tortoise and hare situation.
Hopefully, one where slow and steady wins in the end.