AI’s PERCEPTION TEST
The artificial intelligence journey of the cutting-edge researchers is getting more and more fascinating by the day. After all the final goal is to reach the Artificial General Intelligence, where machine is able to all that we do, with a mathematical precision of a different world. In the process the dependence of the machine will keep going down gradually, till they are able to manage on their own. DeepMind has been one of the pioneers in this journey and has been delivering milestones over and over again. Now it works under the banner of Google, with Demis Hassabis leading from the front as usual.
In this direction, though the company is still not making profits but has been delivering pathbreaking AI milestones from bringing down drastically energy consumption at Google headquarters to pathbreaking research in matrix multiplication and now the new foray into an area sans which the AI revolution can never declare to be anywhere near its promised land. That missing link is perception. Now DeepMind is introducing a Perception Test. It would be a multimodal benchmark using real world videos. The goal is help evaluate the perception capabilities of a machine learning model. There is no denying the fact that benchmarks have played an important role in this journey.
Benchmarks to be precise have helped defining the research goals and enabled researchers to track progress towards those goals. An important component of perception is intelligence. Perception is the process of experiencing the world through the senses. Intelligent agents with human level perceptual comprehension need to developed. These agents would be providing the pathbreaking edge in the crucial fields of robotics, self-driving cars, personal assistants and medical imaging. Perceiver, Flamingo, and BEiT-3 are few multimodal models that seek to be more inclusive models of perception. As no designated benchmarks are available, their assessments are based on specialized datasets. These benchmarks are Kinetics for video action recognition, MOT for object tracking, VQA for image question answering etc.
Each of the current related benchmarks focus on small sub-set of perception, like object capturing low level appearance of objects like color or texture. Temporal aspects are missed out. There are a very few benchmarks which provide tasks across visual and aural modalities. New DeepMind research films of real-life events have been specifically constructed and labeled according to six different sorts of tasks to address many of these problems; in item tracking, localization of temporal actions, temporal sound localization, multiple-choice video question-answering and answering textual questions. Balanced dataset has been the focus. CATER and CLEVRER datasets have also been used. Tasks to be accomplished include knowledge of semantics, understanding of physics, temporal reasoning or memory and abstraction capabilities. The perception test is being created with a purpose to stimulate and direct future investigation into broad perception models.
PERCEPTION TESTS WOULD BE THE BEGINNING OF CREATING PERCEPTION IN THE MACHINE.