























Two-year-old startup Mindbeam AI Inc. today released an open-source artificial intelligence inference framework designed to make large language models run more efficiently on standard consumer processors, a move the company says could reduce reliance on expensive graphics processing units for some AI workloads.
Litespark-Inference is a software library that enables ternary large language models to run on central processing units from Apple Inc., Intel Corp., Advanced Micro Devices Inc. and Arm Holdings plc with significantly improved performance compared with conventional CPU-based inference. The company published benchmarks showing that the framework delivers throughput improvements ranging from 17- to 96-fold over standard PyTorch implementations while reducing memory requirements by more than 80%.
Mindbeam, whose Litespark LLM pretraining frameworks accelerate training and inference workloads for generative AI applications, focuses on a class of neural networks known as ternary models. Those constrain weights to three values: -1, 0 and +1, thereby drastically reducing the overhead of large multiplication operations normally required during inference, although at the loss of some precision.
“We think from a different perspective,” said founder and Chief Executive Nii Osae. “Is there a way that we can do inference with ternary bit models?”
The release comes as the cost of using tokens in AI inference is climbing and organizations are searching for ways to lower the cost of deploying models, particularly in memory-constrained edge use cases. Most LLM inference today relies on GPUs, which are expensive and in short supply. Mindbeam argues that CPUs, which sit alongside GPUs in virtually every AI system, are an underutilized resource.
“In the inference pipeline, inputs come from the user, go to the CPU first and then to the GPU,” Osae said. “The CPU is just passing the messages. Why can’t we place the CPU in the inference stack?”
The company emphasized that it’s not attempting to replace GPUs. Instead, it sees CPUs as complementary accelerators that can improve overall system efficiency. “Now GPUs can process more tokens because they’re having extra help from CPUs,” Osae said.
The software supports two deployment models. One enables AI developers to run language models entirely on local hardware without requiring GPUs. Another is aimed at cloud providers, where CPUs and GPUs work together in a disaggregated inference architecture.
According to the company’s benchmarks, an Apple M5 processor running the framework achieved nearly 40 tokens per second, compared with about 2.3 tokens per second using a PyTorch, a popular open-source framework used to build, train and deploy neural networks.
On systems supporting Intel’s AVX-512 Vector Neural Network Instructions, a dedicated set of CPU instructions designed to accelerate AI deep learning and machine learning inference, throughput reached nearly 34 tokens per second, representing a reported 96-fold improvement over a baseline without the ternary enhancement. Memory consumption fell from roughly 4.6 gigabytes to less than 800 megabytes.
Mindbeam is publishing the source code at https://github.com/Mindbeam-AI/LitesparkInference and encouraging others to perform their own benchmarks.
The framework takes advantage of specialized single instruction, multiple data instructions available in modern processors, including Arm’s NEON SDOT hardware-accelerated instruction set and Intel and AMD vector neural network instructions. The processor architecture and programming technique that allows a single CPU instruction to perform the same operation on multiple pieces of data simultaneously. Mindbeam developed custom kernels that automatically detect available processor features and optimize execution accordingly.
Osae said the initial release supports Apple Silicon, Intel and AMD processors, with future versions targeting cloud-specific hardware such as Amazon Web Services Inc.’s Inferentia chips.
In the future, the company plans to extend the technology beyond language models, with power-sensitive robotics and edge computing applications being primary targets. “We’re targeting action models for robotics because robotics and edge ecosystems need very efficient energy-saving models for inference,” Osae said.
He said Mindbeam intends to commercialize cloud-focused versions of the technology later this year.
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。