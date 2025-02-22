Hyderabad: Microsoft has presented Magma, a foundational model that can understand both pictures and language in the digital world and physical world, to complete tasks like using an app or moving a robot, respectively.

Made by researchers from Microsoft Research, the University of Maryland, the University of Wisconsin-Madison KAIST, and the University of Washington for multimodal AI agents, Magma is the first foundation model capable of interpreting and grounding multimodal inputs within its environment.

Magma combines verbal and spatial intelligence

Magma sports both verbal and spatial intelligence and is able to formulate plans and execute actions to achieve it a described goal.

It is an extension of vision-language (VL) models, which not only retains the VL understanding ability (verbal intelligence) but can also plan and act in the visual-spatial world (spatial intelligence), Microsoft explains, adding that the model cannot only complete agentic tasks like UI navigation but also manipulate the robot.

Magma at a glance (Github/ Microsoft)

"By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial, and temporal intelligence to navigate complex tasks and settings," Microsoft further explains.

How researchers trained Magma

Magma is pre-trained on large amounts of heterogeneous VL datasets including images, videos and robotics data. Researchers labelled actionable items in pictures by Set-of-Mark (SoM), such as clickable buttons in a graphical user interface. They labelled actionable object movements in videos by trace-of-Mark (ToM), such as the path of a robotic arm.

What makes Magma special is its acquisition of spatial intelligence from large-scale training data, facilitated by SoM and ToM. Microsoft says that Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are tailored specifically to these tasks.

"On VL tasks, Magma also compares favourably to popular VL models that are trained on much larger datasets," Microsoft adds.

For training purposes, researchers broke the text into smaller pieces (tokens) while images and videos from different domains were encoded by a shared vision encoder which translated visual information into a format that the model could understand. These discrete and continuous tokens are then fed into a large language model (LLM) to generate the outputs in verbal, spatial, and action types.

Set-of-Mark (SoM) for Action Grounding (Github/ Microsoft)

SoM is used to ground actions across all data types. In the image attached above, SoM prompting enables effective action grounding in images for both UI screenshots (left), robot manipulation (middle), and human video (right) by having the model predict numeric marks for clickable buttons or robot arms in image space.

Trace-of-Mark (ToM) for Action Planning (Github/ Microsoft)

Meanwhile, ToM specifically helps in labelling and understanding movements in videos and robotics data. The image attached above showcases ToM supervisions for robot manipulation (left) and human action (right), compelling the model to comprehend temporal video dynamics and anticipate future states before acting while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions, Microsoft explains.

Capabilities of Magma in real-world

Microsoft also presented a zero-shot evaluation of Magma's agentic intelligence, claiming it to be the only model that can conduct the full task spectrum. In UI navigation, the model performed tasks like checking the weather in a city, turning on flight mode, sharing files, and texting a specific person quite well.

Magma multimodal agentic foundation (Github/ Microsoft)

In robot manipulation, Microsoft claims that it consistently outperformed OpenVLA (finetuning) across soft object manipulation and pick-and-place operations. Magma demonstrated reliable performance in both in-distribution and out-of-distribution generalisation tasks on real robots, Microsoft says.

In spatial reasoning evaluation, Microsoft says that Magma surpasses GPT-4o as it can answer spatial reasoning questions relatively well despite relying on much fewer pretraining data.

In multimodal understanding, Magma performed competitively and even outperformed some state-of-the-art approaches, such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data.