











Illustration of abstract stream. Artificial intelligence. Big data, technology, AI, data transfer, data flow, large language model, generative AI, binary concept
getty
There’s a new bit of jargon in the AI world, but it’s more than just a detail. It involves adding a familiar letter to a familiar acronym, and although that may sound glib, catching up might feel a little like déjà vu.
Do a quick conventional search for “LLMM.” You won’t come up with much, unless you check out the AI overviews, where Gemini in Google or Copilot in Bing tells you what this is.
“MLLM” does a bit better – you might find a result from IBM, and some academic papers, and a page from Github. But the idea of the Multimodal Large Language Model, or to some, the Large Language Multimodal Model, hasn’t really made it into the mainstream, to places like CNBC or Newsweek. It’s still sort of the province of the true tech geek – for now.
The essential concept of a Multimodal Large Language Model is that it works on different kinds of data, although there’s the implication that it does this through specific kinds of design. PhD researcher and engineer Sebastian Raschka defines the MLLM this way on a self-published platform:
“Multimodal LLMs are large language models capable of processing multiple types of inputs, where each ‘modality’ refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more.”
If you assume that the machines do this by attaining something like a sophisticated form of distillation, you’d be right. But there’s another component to this, too. In some ways, it sounds like engineers are going back to the well of using classical ML techniques to enhance what an LLM, as a central “brain,” can do.
MORE FOR YOU
This starts with attaching sensor tools to the LLM itself, to bring that multimodal data in.
“Recent research shows that Multimodal Large Language Models (MLLMs) can be enhanced with sensory gear (e.g., IoT sensors, wearables, cameras) by using visual prompting to ground them in real-world sensor data,” explains a summary of a paper called “By My Eyes” that’s pioneering this kind of research, where authors write:
“We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions. Additionally, we introduce a visualization generator that automates the creation of optimal visualizations tailored to a given sensory task, eliminating the need for prior task-specific knowledge.”
If the traditional token-based LLM approach imitated human writing by scouring the internet and applying prediction models, the new MLLM/LLMM system is able to, in a sense, learn by seeing. It’s not limited to text as an input, or an output. And it’s interactive.
“From a Human Computer Interaction (HCI) and Human Augmentation (HA) perspective, MLLMs also offer various opportunities,” writes Jun Rekimoto in an article maintained at the Association for Computing Machinery’s Digital Library. “If such models can recognize the world in ways similar to humans, a range of applications becomes possible. These include technologies that can record and understand skilled human actions for transfer to others, assess skill development, recognize real-world behaviors to provide personalized assistance and assist individuals with disabilities by augmenting their sensory perception of the environment.”
That said, there’s a lot that MLLMs can do that bypasses traditional inference. That’s especially true when it comes to real-world tasks involving physics. The developer world pondered, for about a year, how to teach LLMs about physics through text, and then the world realized that you could just equip the LLM to see, and teach it that way.
Take the term “feature extraction.”
A model, perhaps a convolutional neural network, can look at an image, analyze it, and extract features to classify and identify what’s in view. Now, you can attach that CNN to an LLM which will then process what the CNN sees and identifies. That’s a powerful combination, and it’s feeding a good deal of research into this kind of build.
Suppose you have a ball bouncing through a room and you want the LLM to “follow the ball.” How do you encode all of that information into the neural net? How do you “show” the model what the ball’s trajectory is like based on real-world physics?
Well, it’s a lot easier if the LLM can see.
Some of the experts are also pointing out that such equipped LLMs can know more about relational data from the jump, eliminating repetitive querying. Some sources estimate that the use of these novel models can lead to up to 75% FLOP reduction.
Within the realm of MLLM design, there’s more jargon emerging. For example, there’s the idea of token sparsification or compression. Here’s an explanation from a page at Github:
“Token compression reduces the number of visual tokens processed by MLLMs while preserving critical cross-modal semantics, enabling more efficient training and faster inference without large accuracy regressions. The field is fragmented across encoders, projectors, and LLM-side techniques; a centralized, searchable resource is needed.”
Then there’s structural pruning and knowledge distillation (here’s a paper) in which similar goals apply. Engineers are finding many ways to increase the efficiency of these models. As for attention mechanisms, there’s a lot of work being done on that, too, but maybe that’s another article.
So although it may look a little like roman numerals, the MLLM, as a descendant of the LLM, has a lot of potential. You may indeed hear a lot more about them, this year and in the years to come.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。