Google’s AVIS program can dynamically select a series of steps to be taken, such as identifying an object in a photo, then searching for information about that object. UCLA, Google
Artificial intelligence programs dazzle with their ability to produce an answer regardless of the request. However, the quality of the response often leaves something to be desired, because programs such as ChatGPT are content to respond to text entries, without special knowledge of the subject, and can therefore produce false answers.
A recent research project by the University of California and Google allows large language models (LLMs) such as Chat-GPT to select a specific tool – be it a web search or optical character recognition – which can then search for an answer in several steps from another source.
A primitive form of “planning” and “reason”
The result is a primitive form of “planning” and “reason”, a way for a program to determine at each moment how a question should be addressed and, once addressed, whether the solution is satisfactory.
The details of this project, called AVIS (for “Autonomous Visual Information Seeking with Large Language Models”) is posted on arXiv. AVIS is based on Google’s Pathways language model, or PaLM, a large language model that has given rise to multiple versions adapted to a variety of approaches and experiments in generative AI.
AVIS is part of the tradition of recent research aimed at transforming machine learning programs into “agents” whose action is broader than simply predicting the next word. These include BabyAGI, an “AI-powered task management system” presented this year, and PaLM*E, presented this year by Google researchers, which can instruct a robot to follow a series of actions in physical space.
No pre-established action plan
The great advance of the AVIS program is that, unlike BabyAGI and PaLM*E, it does not follow a pre-established action plan. On the contrary, it uses an algorithm called a “scheduler” which selects a choice of actions on the fly, as the situation arises. These choices are generated as the linguistic model evaluates the requested text, decomposes it into sub-questions, then establishes a correlation between these sub-questions and a set of possible actions.
Even the choice of actions is a new approach.
The researchers conducted a survey of 10 people who had to answer the same types of questions, such as “What is the name of the insect?”appearing on an image. Their choices of tools, such as Google Image Search, have been recorded.
The capacity of the reasoner
The authors then integrated these examples of human choices into what they call a “transition graph”, a model of how humans choose tools at each moment.
The planner then uses the graph, choosing from the “relevant examples in context […] which are assembled from the decisions previously made by humans”. This is a way to get the program to model itself on human choices, using past examples as so many additional inputs for the language model.
In order to control its choices, the AVIS program has a second algorithm, a “reasoner”, which evaluates the usefulness of each tool after it has been tested by the language model, before deciding whether to provide an answer to the initial question. If the choice of a particular tool was not useful, the reasoner sends the planner back to his drawing board.
The total AVIS workflow consists of designing questions, selecting tools, and then using the reasoner to check if the tool has produced a satisfactory answer. UCLA, Google
The researchers tested AVIS on some standard automated reference tests for answering visual questions, such as OK-VQA, introduced in 2019 by researchers at Carnegie Mellon University. On this test, AVIS achieved “an accuracy of 60.2, higher than most existing methods adapted to this data set,” they report. In other words, the general approach here seems to surpass the methods that have been carefully adapted to a specific task, an example of the increasing generality of AI machine learning.
In conclusion, the researchers emphasize that they plan to go beyond image issues in their future work. “We want to extend our dynamic decision-making framework powered by LLM to other reasoning tasks,” they write.
Source: “ZDNet.com “