GPT Models as an Operating System / Communication Protocol for Autonomous Agents
While GPT models have been around for a few years, it was only after the arrival of OpenAI’s ChatGPT that they caught the wider attention of the public. Frantically, managers and IT specialists alike are searching for ways how they can make best use of this new technology. As usual, the first ideas will turn out to be not necessarily the most mature, and it will take some time until the full potential – and the restrictions – of these GPT models are generally understood.
With the notable exception of Auto-GPT most proposed use cases are focused in one way or another on classical Natural Language Understanding (NLU) cases: Answering questions about a corpus of documents, text summarisation, simplifying language, chatting with users, and so on. Probably every large corporation has already had the idea of creating their own HelpdeskGPT assistant, and project teams are trying out whether or not it is possible to train ChatGPT on the company’s internal data to enable it to answer questions on internal processes.
While these developments are exciting enough, I am arguing that treating GPT models as a sophisticated NLU engine are not yet tapping into the full potential this technology has. Two very recent developments are worth noting in this context.
The first one are OpenAI’s latest attempts to allow selected developers create plugins. A plugin is, in very short, nothing but the ability of ChatGPT as a natural language model to make a call to a specific REST API. This approach will allow completely new ways how users can interact with ChatGPT. Imagine a user talking in natural language with a GPT model telling it to book a restaurant table at a given place and time. Or think of a user browsing the current cinema movies on play and then deciding to book a show for two. Clearly, there are many details yet to be sorted out such as security concerns, financial transactions and so on, but the intention is clear: Create a virtual assistant that helps a user navigate the complexity of the world by means of natural language.
The second latest development is the emergence of Auto-GPT. Similarly to ChatGPT’s plugins Auto-GPT’s plugins allow a user to make calls to third-party APIs. Currently, many users are however fascinated by its ability to develop “execution plans”, i.e. breaking down a goal-oriented task into multiple sub-steps and execute them individually. Yet, Auto-GPT’s approach and its architecture are distinct from ChatGPT’s. Auto-GPT is running locally on your own computer (e.g. in a VM). Its magic is to tell ChatGPT quite precisely upfront i) what the allowed responses are, ii) to use ChatGPT to figure out the intent of the user, and iii) to select the appropriate response in a JSON format plugging in missing parameters. Auto-GPT is parsing the selected response and derives from it what API call to make. APIs in this context are represented as Auto-GPT plugins (which are different from ChatGPT plugins technically) running locally in the same computer as Auto-GPT itself. This allows a much greater degree of freedom, for example the manipulation of local files, calls from local to the internet, calling third-party services from local that are not allowed from within ChatGPT and so on. Obviously, security is a very big concern here.
What both examples demonstrate to a different degree is the possibility to use a GPT model as the glue code to navigate from user intents expressed in natural language to technical commands expressed in a formalized language (JSON). As stated above, it is this ability that will allow building virtual assistants of a yet unknown complexity and power.
But the same technology allows yet another type of use case that in my personal view will be the real game changer. The translation of intents expressed in natural language to formal API calls can be used to connect multiple autonomous agents with each other. The agents serve as entry and exit points for third-party systems to communicate with each other, acting as translators of formal API calls to natural language and vice versa back and forth. In this way, the GPT model becomes a communication protocol in natural language for two autonomous agents to talk to each other. Alternatively, we could state that GPT serves as an operating system for autonomous agents to interact with each other. The crucial point here is that in the past all attempts to build autonomous interacting agents remained limited by the creation of a common formalized communication protocol or ontology. Each agent had to be taught exactly how to interact with each other agent it ever needed to communicate with, thus resulting in an exponential explosion of complexity due to the increasing number of communication paths and protocols between agents. All attempts to use formal logic (e.g. predicate logic and the like) remained relatively limited to very narrow domains.. With GPT it should be possible to reduce the number of communication protocols to only one: natural language. Agents will be able to talk to each other like humans do. So, a virtual assistant would literally ask a restaurant agent in natural language – even overcoming language translation barriers: “Can I book a table for 4 persons next Thursday evening?” With the second agent answering: “Sure, do you prefer a table next to the bar or to the doorway? And what is your preferred time?” Obviously, this will result in lots of failed attempts to communicate between agents, misinterpretations, perhaps even Freudian slips and the like; in short, it will result in a style of communication among agents that is not too far away from how humans communicate. Occasionally, agents might even create confusion and chaos, for example ordering the wrong pair of shoes without the user noticing. But that is the cost to pay for a universal communication protocol.
Obviously, establishing a network of autonomous agents communicating with natural language among themselves will require a lot more engineering than is currently available. We will need ways how to identify agents among each other, secured connections (e.g. through public/private key exchanges), secure financial transactions, security for sharing specific items of sensitive data and more. In the next few months and years we will see an explosion of added services running on top of GPT models as the underlying operating system or communication bus to enable agents talk to each other.
The possibilities of autonomous agents talking to each other via natural language are beyond my comprehension at this point. As are the dangers. Imagine an ERP system noticing a shortage of certain screws. It sends an event to its own autonomous agent. The agent reaches to a GPT model or a specific “registry agent” or “service search agent” asking whether there is any other agent out there who could sell that specific type of screw. The GPT model provides two distinct sales agents that have this particular type of screw on stock. Together with this information it also provides the required API protocol, and pre-fills the data to be sent to each of those protocols. The ERP agent now makes calls to the APIs of both sales agents, and finally decides from which to buy the screws. Obviously, some common payment gateway must be used. Few days later the shipment is made to the company buying the screws. What is genuinely different from the past in this example is that the ERP agent does not need prior knowledge of the exact protocols the other two sales agents are having. This is the part that GPT takes care of. GPT is the middleman here to bridge the gap between the natural language the ERP agent and the sales agents are using and the formal APIs they are offering. GPT must have enough understanding to be able to interpret the APIs of all those agents. But this is a problem that is already being actively researched. The most promising approach so far I have seen is to teach the GPT model upfront about the correct (and incorrect) use of each API that needs to be added. With a few examples, it should be able to generalize the API use by itself, and thus be able to add yet another API to its list of offered API endpoints. Thus, GPT becomes the aforementioned communication bus for autonomous agents.
What Auto-GPT already offers at this point is an extension to this setup in the form of “memory” (e.g. through Pinecone). Once two agents have succeeded in communicating with each other a lot of the prior finding and setup work for making API calls is no longer needed. The agent should have learned by now how to contact those two sales agents, and it can keep this knowledge in memory. Only in a situation where unexpectedly the interaction with any of the two sales agents is no longer working does the ERP agent have to contact again the GPT communication bus to figure out what’s the right interaction model with these two or yet some new sales agents. How exactly this is implemented is beyond the scope of this article, the reader is referred to Auto-GPT once more.