惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
V2EX - 技术
V2EX - 技术
The Register - Security
The Register - Security
H
Help Net Security
S
SegmentFault 最新的问题
宝玉的分享
宝玉的分享
Recorded Future
Recorded Future
GbyAI
GbyAI
Recent Announcements
Recent Announcements
T
Tailwind CSS Blog
MyScale Blog
MyScale Blog
L
LangChain Blog
D
DataBreaches.Net
M
MIT News - Artificial intelligence
雷峰网
雷峰网
WordPress大学
WordPress大学
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
Apple Machine Learning Research
Apple Machine Learning Research
H
Hackread – Cybersecurity News, Data Breaches, AI and More
博客园 - 司徒正美
C
Check Point Blog
T
The Blog of Author Tim Ferriss
F
Fortinet All Blogs
Microsoft Security Blog
Microsoft Security Blog
T
The Exploit Database - CXSecurity.com
G
Google Developers Blog
博客园 - 聂微东
MongoDB | Blog
MongoDB | Blog
Blog — PlanetScale
Blog — PlanetScale
D
Darknet – Hacking Tools, Hacker News & Cyber Security
P
Palo Alto Networks Blog
有赞技术团队
有赞技术团队
Attack and Defense Labs
Attack and Defense Labs
N
News | PayPal Newsroom
V
V2EX
T
Troy Hunt's Blog
N
News and Events Feed by Topic
The GitHub Blog
The GitHub Blog
Webroot Blog
Webroot Blog
The Hacker News
The Hacker News
I
InfoQ
L
LINUX DO - 最新话题
AWS News Blog
AWS News Blog
美团技术团队
博客园 - 叶小钗
SecWiki News
SecWiki News
G
GRAHAM CLULEY
Vercel News
Vercel News
A
About on SuperTechFans

Databricks

Why Talent Transformation Is the Missing Focus of Enterprise AI Public Health Intelligence Shouldn't Require a Data Scientist Mean Time to Detect Is a Data Access Problem First-party audience data is the ad sales relationship now Rethinking Distributed Systems for Serverless Performance and Reliability The AI Scaling Gap Hiding in Digital Native Companies 10 trillion samples a day: Scaling beyond traditional monitoring infra at Databricks AI success starts with clean data, not just better models How nOps Rebuilt Their Cloud Optimization Platform on Databricks Lakebase, and Why Other ISVs Should Too Peril Predicts: Precision Payouts for a Volatile World The foundation of AI scalability: one team, one platform, one operating model The Federal Data Paradox: Rich in Data, Poor in Access Driving Budapest Forward: How BKK Uses Databricks to Transform City Mobility LLM Vs AI: A Practical Guide to Differences, Use Cases, and Tools Model Risk Governance Is Not the Same as Risk Intelligence Generative AI for Business: A Complete Strategy and Implementation Guide Data Science vs Data Engineering: Choosing Analysis or Infrastructure AI Applications: Tools, Use Cases, and Platforms MLOps vs DevOps: A Practical Guide for Data Scientists and IT Teams Top Data Warehouse Tools For Modern Data Analytics Unlocking SAP Business Context in Databricks with Semantic Metadata Delta Sharing The marketing activation gap has a fix: Databricks and Stitch partner to turn data infrastructure into marketing performance Alert Fatigue Is a Business Risk Backstage with Lakebase Shipping Faster isn’t Learning Faster Why Your OEE Dashboard Is Lying to You The Turbine That Tried to Tell You It Was Failing Predicting Readmissions Isn't Enough. Acting in Time Is. Clinical Trials Run Longer Than They Have To. That's a Patient Problem Network Quality Is a Revenue Problem, Not a Technical One Shelf Availability Starts with Better Demand Visibility When Predicting the Next Hit Requires More Than Intuition Approximate Answers, Exact Decisions: New Sketch Functions for Analytics Companies Winning with AI Built the Data Layer First Rethinking SQL ETL for modern data platforms Stripe data now available on Databricks via Databricks Marketplace Databricks and Stripe Projects: Infrastructure Built for Agents Agents are ready but your architecture probably isn't Interoperability Between Unity Catalog and Google BigQuery via Catalog Federation Built In, Not Bolted On: What AI-Native Actually Means in Cybersecurity Operationalizing AI for public sector fraud prevention From months to minutes: Building real-time clinical data pipelines with natural language Agentic Data Engineering with Genie Code and Lakeflow Securely send first-party conversion signals with Snapchat Conversions API on Databricks Marketplace How leading tech companies are killing the builder’s tax with Lakebase Inside one of the first production deployments of Lakebase: LangGuard's agentic workflow governance engine The next generation of Databricks Genie Model Risk Management in 2026: A Banker’s Guide to the Revised Interagency Guidance OpenAI GPT-5.5 now available on Databricks, fully-governed through Unity AI Gateway Operational databases: How they work and when to use them Databricks partners with OpenAI on GPT-5.5 Announcing the Public Preview of Lakeflow Designer Are LLM agents good at join order optimization? How conversational analytics removes the BI bottleneck How to transform document activation workflows with Genie and Agent Bricks Beyond the spreadsheet: how Databricks is delivering the modern CFO in Financial Services AI App Development: Guide To Building AI-Powered Apps IoT in Manufacturing: Strategy, Components, Use Cases, and Challenges Stop Hand-Coding Change Data Capture Pipelines Multimodal Data Integration: Production Architectures for Healthcare AI Personalization Strategies for Media Companies A Modern AI Risk Management Framework Introducing the Databricks Excel Add-in for Business Users Real-Time Decisioning for AI Agents: Why you Need a Customer Context Layer First A Practical Guide to LLM Fine Tuning AI Data Transformation Guide for Data Engineers and Data Scientists Concurrency Control in DBMS: How Locking, MVCC and Optimistic Strategies Keep Data Consistent Bridging data science and marketing: Databricks unveils Delta Sharing integration for Adobe Experience Platform and agentic marketing workflows Take Control: Customer-Managed Keys for Lakebase Postgres Get hands on with agents, vibe coding and more at Data+ AI Summit Mercedes-Benz Builds a Cross-Cloud Data Mesh with Delta Sharing and Intelligent Replication, Cutting Costs by 66% What Is a Transactional Database? Introducing Genie Agent Mode Governing coding agent sprawl with Unity AI Gateway Governing Coding Agent Sprawl with Unity AI Gateway What is pgvector? Banks Don’t Have an AI Problem – They Have a Data Platform Problem Open Platform, Unified Pipelines: Why dbt on Databricks is Accelerating Why Your Agents Can’t Read Enterprise Documents — and How to Fix It Building with Databricks Document Intelligence and Lakeflow Databricks on Google Cloud: Innovate Faster. Smarter. Together. Introducing the Databricks Connector for Google Sheets: Real-Time, Governed Lakehouse Data in the Sheets Users Love Unity AI Gateway: How to connect agents to external MCPs securely Expanding agent governance with Unity AI Gateway Agentic reasoning in practice: Making sense of structured and unstructured data Agent Bricks: The Governed Enterprise Agent Platform 8 AI and data trends shaping financial services in 2026 Building real-time product search on Databricks Lovable + Databricks: Build Data-Driven Apps at the Speed of Thought Memory scaling for AI agents Powering clinical research innovation: How TriNetX uses Databricks to accelerate drug development Database Branching in Postgres: Git-Style Workflows with Databricks Lakebase How Zalando built a unified data foundation for AI and analytics on Databricks The next era of the open lakehouse: Apache Iceberg™ v3 in Public Preview on Databricks How FSIs eliminate silos between clients, operations, and finance How MakeMyTrip achieved millisecond personalization at scale with Databricks A multi-agent approach to audience intelligence AiChemy: Next-generation agent with MCP, skills and custom data for drug discovery Accelerate business insights with Lakeflow Connect, now with a Free Tier Unlocking Next-Gen Customer Experiences with Data Intelligence for Marketing
Accelerate search queries with full-text search indexes on Databricks
Yu Xu · 2026-06-16 · via Databricks

Full-text search indexes can accelerate queries by 100x or more on open format tables, without changing table layouts

by Yu Xu, Yingyi Bu and Ivan Vezilić

Every data team faces the same challenge as tables grow to hundreds of gigabytes or terabytes and beyond: text search queries become painfully slow, whether searching for an error message across terabytes or petabytes of application logs, finding a suspicious IP address in security data, or locating specific content in a compliance dataset. These queries can end up scanning far more data than necessary, making fast, targeted lookups a challenge at scale. Today, we’re excited to announce a solution to these challenges: full-text search indexes are available in Beta on Databricks Runtime 18.2.

Teams are often forced into workarounds: maintaining duplicate tables, building external search systems like Elasticsearch or Splunk, or over-engineering table layouts to optimize for one query pattern at the expense of others.

image2.png

What are full-text search indexes?

A full-text search index accelerates substring and keyword queries across text columns. Once you create one, the query engine uses it automatically, so searches that used to scan an entire table read only a small fraction of it. (We cover how this works below.)

Full-text search indexes are ideal for high-cardinality lookups and for searching across many text columns at once. Common examples include:

  • Log analytics and SIEM (Security Information and Event Management): Searching for error messages, IP addresses, or suspicious patterns across terabytes of security and application logs.
  • Trust and Safety investigations: Finding specific content across massive content moderation datasets.
  • Compliance auditing: Locating records containing specific terms in regulatory reporting data.

Get started with just a SQL statement:

Queries are then accelerated automatically:

The query engine discovers the search index and uses it to skip the vast majority of files, often accelerating queries by orders of magnitude.

How it works under the hood

Full-text search indexes are stored separately from the base table. When you create an index, Databricks tokenizes the text content and builds an internal index: a compact lookup structure that maps tokens to the matching rows. At query time, the engine consults this index to identify which files might contain matching rows, then reads only those files.

image3.png

This architecture delivers several key advantages:

  • Zero impact on write performance: Indexes are maintained asynchronously. Writing to the base table is never slowed down by indexing.
  • Automatic query optimization: The Databricks query engine evaluates available indexes and selects the best access path, with no query hints required.
  • Correctness guaranteed: Even when an index is stale (behind the base table), query correctness is preserved. Databricks scans both indexed and non-indexed portions of the table as needed, so results are always complete and accurate.
  • Works with Delta and Iceberg: Full-text search indexes support Unity Catalog managed Delta and Iceberg tables on both serverless and classic compute.

How do full-text search indexes relate to Liquid Clustering?

Liquid clustering and full-text search indexes solve different problems. Liquid clustering organizes your data physically so queries that filter on the clustering key can skip large chunks of data efficiently. Clustering helps with equality and range filters on column values, but it cannot help locate a substring or keyword within a field.

Full-text search indexes, by contrast, search within the text of column values, enabling fast substring and keyword lookups. This means full-text search indexes speed up queries even on columns you've already clustered on, because clustering alone cannot find a match within a field's content.

In short, Liquid clustering optimizes filtering by column values; full-text search indexes optimize searching within column values. They complement each other and work together on the same table.

Customer performance results

A Trust and Safety team adopted full-text search indexes for investigations over a petabyte-scale table.

A substring search that previously had to scan the entire table now runs more than 100x faster, making interactive investigations practical for the first time.

We are actively working on more efficient index layouts and other optimizations to deliver even greater speedup in upcoming releases.

Getting started

Full-text search indexes are available in Beta on Databricks Runtime 18.2. To get started, see the full-text search indexes documentation.

What’s next

The next milestone for full-text search indexes is 18.3, where you can expect:

  • Full Unity Catalog integration with automatic permission inheritance.
  • Automatic maintenance via Predictive Optimization: No more manual REFRESH INDEX. Your indexes are kept up to date automatically.

We’re eager to hear your feedback during the Beta period. Try full-text search indexes on your workloads today and help us shape the Public Preview release.