惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
小众软件
小众软件
博客园 - 【当耐特】
Last Week in AI
Last Week in AI
Jina AI
Jina AI
云风的 BLOG
云风的 BLOG
腾讯CDC
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Y
Y Combinator Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Engineering at Meta
Engineering at Meta
量子位
美团技术团队
I
InfoQ
Martin Fowler
Martin Fowler
MyScale Blog
MyScale Blog
博客园 - 聂微东
阮一峰的网络日志
阮一峰的网络日志
Blog — PlanetScale
Blog — PlanetScale

Forbes - Innovation

‘Zero Parades’ Review: The Opera Never Tires 10 Best Star Wars Games To Play In 2026 Apple Loop: iPhone 18 Pro Upgrades, Fortnite Returns To The App Store, iPhone Fold Delays Android Headlines: Galaxy Z Fold 8 Feature, RedMagic 11S Pro Debuts, Anker’s AI Earbuds Immigration Service May Significantly Restrict Green Cards In The U.S. Starbucks Drops AI As Meta And Intuit Cut 11,000 Jobs Why Hybrid AI Is No Longer Optional In Banking And Finance Did AI Really Beat ER Doctors At Diagnosis? Here’s What The Study Showed Did AI Really Beat ER Doctors At Diagnosis? No, Here’s What Study Really Showed Quordle Hints Today: Saturday, May 23 Clues And Answers Samsung Unveils Immersive Moomin Takeover And Pop-Up Experience At London King's Cross Manziel Vs. Menery Full Card, Ring Walk Times For Brand Risk 14 The AI Video Race Is Moving Beyond Pretty Clips Google Signals AI Video’s Shift From Clip Generation To Production Crafoord Prize Winner Ramanathan: Climate Action Enters Its “How” Phase Marriage Benefits Men's Life Expectancy More Than Women's 3 Steps Not To Ignore In Nature Plans The Post-‘The Boys’ Finale ‘Vought Rising’ Trailer Is Here, And Quite Good 2026 America Innovates | Responsible For All Our Digital Maps, Jack Dangermond Loves The Word 'Where' 2026 America Innovates | Fracking Pioneer Harold Hamm Calls Oil And Gas The Most Reliable Energy For AI Why Tom Hardy Was Reportedly Just Fired From ‘Mobland’ Season 3 How Small Studios Outrun Bigger Teams Sony Launches Reon Pocket Pro Plus Wearable Air Conditioner In Time For Summer Heatwaves Industry 5.0 Is Changing The Meaning Of Automation Garmin Watches, Coros And More Now Pair Better With Strava NYT Connections Hints Today: Saturday, May 23 Groups And Answers (#1077) The Architectural Difference Between Legal Productivity AI And EDiscovery AI ‘The Mandalorian And Grogu’ Sets A Rotten Tomatoes Audience Score Record How AI Tools Are Redefining The Role Of Technical Founders Apple Spotlights Student Entrepreneurs In Great Ideas Start Here Campaign The Growing Cybersecurity Risks To The Supply Chain In The AI Era Your Website Is Decaying Consumer Intent Faster Than You Think With ‘Destiny 2’ Gone, No ‘Destiny 3’ Is Coming ​How Operational Access Can Ensure Readiness For The Next Storm Why Russians Are In Despair Over Truck-Busting ’Martian’ Drones New ‘Crimson Desert’ Patch Adds Another Long-Time Player Request The Architecture Behind Cost-Effective AI Agents How To Think About High-Stakes Dispute Resolution Why Do Our Fingers Get Wrinkly In Water? An Evolutionary Biologist Explains You Can Build A CRM In A Day. You Still Can't Run A Company In One. 6 Teachable Moments From An Atlanta Rush Hour Downpour Why Your AI-Generated Marketing Content Sounds Generic ​The Accountability Crisis In The Creator Economy Scaling Across Borders: What It Takes To Succeed Globally Apple Rolls Out Two Crucial Health Features For Apple Watch And AirPods In India Competitive Advantage In Logistics Isn't AI ​Why AI Can Write Code, But It Can't Teach Engineers Critical Thinking The Importance Of Red Teaming For Scaling Enterprise AI Agents Why The Next AI Moat Won’t Be Productivity, But Emotional Value Banking’s AI Problem Isn’t The Model. It’s The Plumbing The Case For Structural Reform Through Tokenization SpaceX Scrubs Starship Launch As $2 Trillion IPO Nears LEGO F1 Ferrari Helmet Review (43014): Rough Build, Spectacular Finish Oleksandr Usyk Vs. Rico Verhoeven: Date, Time And How To Watch If Majoring In Computer Science Is Doomed Due To AI, The Latest Claim Is That Majoring In Philosophy Is The Next Best Choice MVP's Nakisa Bidarian On Rousey-Carano Viewership, Shields' Ban And PFL Co-Promotion See A ‘Planet Parade’ As Three Worlds Shine After Sunset This Weekend Soundcore’s Liberty 5 Are First Earbuds To Use Anker’s Thus AI Chip Code Ninjas: The AI-In-Education Problem Isn’t Cheating. It’s Passivity. Today’s Wordle #1798 Hints And Answer For Friday, May 22 NYT ‘Pips’ Hints, Answers And Walkthrough For Friday, May 22 Apple Teases iOS 27 AI Upgrades With Major Accessibility Overhaul To iPhone Samsung Releases Free One UI 8.5 Upgrade To Millions Of Galaxy Phones How Instagram Became A Venture Capital Deal Engine ‘Star Wars: The Mandalorian And Grogu’: Which Movie Is Best? New Study: A Quarter Of College Students Using AI Daily Cheat With It NYT Connections Hints Today: Friday, May 22 Clues And Answers (#1,076) NYT Connections Answers Explained Friday May 22 NYT Strands Hint Today: Friday, May 22 Clues And Answers (Put Down Your Ruler) Quordle Hints Today: Friday, May 22 Clues And Answers Webb Telescope Detects Cloudy Mornings And Clear Nights On Alien World AI Flattening Organizations Is The Latest Chapter In A Continuing Story AI Was Supposed To Reduce Your Workload. Here’s Why It Hasn’t, And Here’s How It Can. DevOps Practices Tech Teams Must Strengthen In The AI Era The End Of ‘Destiny 2’: All Expansions Canceled, Maintenance Mode Incoming ‘The Mandalorian And Grogu’ Recap Before You See The Movie, Post-Credits Scene And More Fidelity Collective Buys Up Westone Audio And Etymotic Brands Why AI Profitability Belongs To Enterprise, Not Consumer Scale OpenAI And Anthropic Are Testing Two Very Different AI Business Models Kordata Launches To Advance Neurotech-Powered Clinical Trials Solving The Identity Crisis: Putting Today’s Fragmented Consumer Back Together These Are The Most- And Least-Expensive New Cars To Run At Today’s Fuel Prices New Reports And New Paradigms Show Drive In AI Smart Glasses Market Samsung Galaxy Z Fold 8: Price Rise, Bad Crease News Anthropic And Microsoft Team Up Why Nvidia Needs More Than GPUs To Win The AI Infrastructure Race Nvidia Is Expanding Infra Partnerships. Will A Big Deal Happen? Drug Overdose Deaths Fell in 2024. Why Experts Remain Cautious Microsoft Is Scrapping SMS 2FA Codes—What You Need To Do ‘Wax Heads’ Review: Somehow The Vital Connection Is Made Securing The Internet’s Humanity Netflix’s Best New Show Lands A Perfect Rotten Tomatoes Score As A Final Duffer Bros. Effort AI Might Not Bring On A Job Crisis, But A Workforce ‘Mismatch’ Could Why Post-Quantum Compliance For Banks Starts In Containers Do Your AI Agents Have Governance? Most Don’t, And They’re Live Why Complexity Is The Insider Threat Hiding In Plain Sight ‘Supergirl’ Is Starting To Feel Like It May Be A Big DCU Miss Google Confirms 2 Critical New Flaws—How To Jump The Update Queue Google Splits Its Agent Strategy For Two Developer Audiences Rethinking GRC In The Tokenized Economy
The Rise Of The Multimodal LLM
John Werner, · 2026-05-23 · via Forbes - Innovation
Illustration of abstract stream. Artificial intelligence. Big data, technology, AI, data transfer, data flow, large language model, generative AI, binary concept

Illustration of abstract stream. Artificial intelligence. Big data, technology, AI, data transfer, data flow, large language model, generative AI, binary concept

getty

There’s a new bit of jargon in the AI world, but it’s more than just a detail. It involves adding a familiar letter to a familiar acronym, and although that may sound glib, catching up might feel a little like déjà vu.

Do a quick conventional search for “LLMM.” You won’t come up with much, unless you check out the AI overviews, where Gemini in Google or Copilot in Bing tells you what this is.

“MLLM” does a bit better – you might find a result from IBM, and some academic papers, and a page from Github. But the idea of the Multimodal Large Language Model, or to some, the Large Language Multimodal Model, hasn’t really made it into the mainstream, to places like CNBC or Newsweek. It’s still sort of the province of the true tech geek – for now.

What is a Multimodal Large Language Model?

The essential concept of a Multimodal Large Language Model is that it works on different kinds of data, although there’s the implication that it does this through specific kinds of design. PhD researcher and engineer Sebastian Raschka defines the MLLM this way on a self-published platform:

“Multimodal LLMs are large language models capable of processing multiple types of inputs, where each ‘modality’ refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more.”

If you assume that the machines do this by attaining something like a sophisticated form of distillation, you’d be right. But there’s another component to this, too. In some ways, it sounds like engineers are going back to the well of using classical ML techniques to enhance what an LLM, as a central “brain,” can do.

MORE FOR YOU

This starts with attaching sensor tools to the LLM itself, to bring that multimodal data in.

“Recent research shows that Multimodal Large Language Models (MLLMs) can be enhanced with sensory gear (e.g., IoT sensors, wearables, cameras) by using visual prompting to ground them in real-world sensor data,” explains a summary of a paper called “By My Eyes” that’s pioneering this kind of research, where authors write:

“We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions. Additionally, we introduce a visualization generator that automates the creation of optimal visualizations tailored to a given sensory task, eliminating the need for prior task-specific knowledge.”

The Art of Imitation

If the traditional token-based LLM approach imitated human writing by scouring the internet and applying prediction models, the new MLLM/LLMM system is able to, in a sense, learn by seeing. It’s not limited to text as an input, or an output. And it’s interactive.

“From a Human Computer Interaction (HCI) and Human Augmentation (HA) perspective, MLLMs also offer various opportunities,” writes Jun Rekimoto in an article maintained at the Association for Computing Machinery’s Digital Library. “If such models can recognize the world in ways similar to humans, a range of applications becomes possible. These include technologies that can record and understand skilled human actions for transfer to others, assess skill development, recognize real-world behaviors to provide personalized assistance and assist individuals with disabilities by augmenting their sensory perception of the environment.”

That said, there’s a lot that MLLMs can do that bypasses traditional inference. That’s especially true when it comes to real-world tasks involving physics. The developer world pondered, for about a year, how to teach LLMs about physics through text, and then the world realized that you could just equip the LLM to see, and teach it that way.

Terms from the Aughts

Take the term “feature extraction.”

A model, perhaps a convolutional neural network, can look at an image, analyze it, and extract features to classify and identify what’s in view. Now, you can attach that CNN to an LLM which will then process what the CNN sees and identifies. That’s a powerful combination, and it’s feeding a good deal of research into this kind of build.

Suppose you have a ball bouncing through a room and you want the LLM to “follow the ball.” How do you encode all of that information into the neural net? How do you “show” the model what the ball’s trajectory is like based on real-world physics?

Well, it’s a lot easier if the LLM can see.

Some of the experts are also pointing out that such equipped LLMs can know more about relational data from the jump, eliminating repetitive querying. Some sources estimate that the use of these novel models can lead to up to 75% FLOP reduction.

More Techniques

Within the realm of MLLM design, there’s more jargon emerging. For example, there’s the idea of token sparsification or compression. Here’s an explanation from a page at Github:

“Token compression reduces the number of visual tokens processed by MLLMs while preserving critical cross-modal semantics, enabling more efficient training and faster inference without large accuracy regressions. The field is fragmented across encoders, projectors, and LLM-side techniques; a centralized, searchable resource is needed.”

Then there’s structural pruning and knowledge distillation (here’s a paper) in which similar goals apply. Engineers are finding many ways to increase the efficiency of these models. As for attention mechanisms, there’s a lot of work being done on that, too, but maybe that’s another article.

So although it may look a little like roman numerals, the MLLM, as a descendant of the LLM, has a lot of potential. You may indeed hear a lot more about them, this year and in the years to come.