惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Help Net Security
Help Net Security
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Threat Research - Cisco Blogs
T
The Exploit Database - CXSecurity.com
P
Privacy International News Feed
T
Threatpost
T
Tor Project blog
AWS News Blog
AWS News Blog
S
Schneier on Security
Cyberwarzone
Cyberwarzone
The Hacker News
The Hacker News
Scott Helme
Scott Helme
C
Cybersecurity and Infrastructure Security Agency CISA
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
P
Palo Alto Networks Blog
P
Proofpoint News Feed
Vercel News
Vercel News
Recent Commits to openclaw:main
Recent Commits to openclaw:main
V
V2EX
腾讯CDC
C
CERT Recently Published Vulnerability Notes
www.infosecurity-magazine.com
www.infosecurity-magazine.com
V2EX - 技术
V2EX - 技术
C
Cyber Attacks, Cyber Crime and Cyber Security
MyScale Blog
MyScale Blog
博客园 - 三生石上(FineUI控件)
有赞技术团队
有赞技术团队
D
Docker
Security Latest
Security Latest
云风的 BLOG
云风的 BLOG
G
Google Developers Blog
Know Your Adversary
Know Your Adversary
宝玉的分享
宝玉的分享
爱范儿
爱范儿
Simon Willison's Weblog
Simon Willison's Weblog
N
News | PayPal Newsroom
Recent Announcements
Recent Announcements
小众软件
小众软件
Project Zero
Project Zero
SecWiki News
SecWiki News
Microsoft Azure Blog
Microsoft Azure Blog
月光博客
月光博客
Cloudbric
Cloudbric
博客园 - Franky
Forbes - Security
Forbes - Security
C
Cisco Blogs
Webroot Blog
Webroot Blog
H
Help Net Security

Databricks

Why Talent Transformation Is the Missing Focus of Enterprise AI Public Health Intelligence Shouldn't Require a Data Scientist Mean Time to Detect Is a Data Access Problem First-party audience data is the ad sales relationship now Rethinking Distributed Systems for Serverless Performance and Reliability The AI Scaling Gap Hiding in Digital Native Companies 10 trillion samples a day: Scaling beyond traditional monitoring infra at Databricks AI success starts with clean data, not just better models How nOps Rebuilt Their Cloud Optimization Platform on Databricks Lakebase, and Why Other ISVs Should Too Peril Predicts: Precision Payouts for a Volatile World The foundation of AI scalability: one team, one platform, one operating model The Federal Data Paradox: Rich in Data, Poor in Access Driving Budapest Forward: How BKK Uses Databricks to Transform City Mobility LLM Vs AI: A Practical Guide to Differences, Use Cases, and Tools Model Risk Governance Is Not the Same as Risk Intelligence Generative AI for Business: A Complete Strategy and Implementation Guide Data Science vs Data Engineering: Choosing Analysis or Infrastructure AI Applications: Tools, Use Cases, and Platforms MLOps vs DevOps: A Practical Guide for Data Scientists and IT Teams Top Data Warehouse Tools For Modern Data Analytics Unlocking SAP Business Context in Databricks with Semantic Metadata Delta Sharing The marketing activation gap has a fix: Databricks and Stitch partner to turn data infrastructure into marketing performance Alert Fatigue Is a Business Risk Backstage with Lakebase Shipping Faster isn’t Learning Faster Why Your OEE Dashboard Is Lying to You The Turbine That Tried to Tell You It Was Failing Predicting Readmissions Isn't Enough. Acting in Time Is. Clinical Trials Run Longer Than They Have To. That's a Patient Problem Network Quality Is a Revenue Problem, Not a Technical One Shelf Availability Starts with Better Demand Visibility When Predicting the Next Hit Requires More Than Intuition Approximate Answers, Exact Decisions: New Sketch Functions for Analytics Companies Winning with AI Built the Data Layer First Rethinking SQL ETL for modern data platforms Stripe data now available on Databricks via Databricks Marketplace Databricks and Stripe Projects: Infrastructure Built for Agents Agents are ready but your architecture probably isn't Interoperability Between Unity Catalog and Google BigQuery via Catalog Federation Built In, Not Bolted On: What AI-Native Actually Means in Cybersecurity Operationalizing AI for public sector fraud prevention From months to minutes: Building real-time clinical data pipelines with natural language Agentic Data Engineering with Genie Code and Lakeflow Securely send first-party conversion signals with Snapchat Conversions API on Databricks Marketplace How leading tech companies are killing the builder’s tax with Lakebase Inside one of the first production deployments of Lakebase: LangGuard's agentic workflow governance engine The next generation of Databricks Genie Model Risk Management in 2026: A Banker’s Guide to the Revised Interagency Guidance OpenAI GPT-5.5 now available on Databricks, fully-governed through Unity AI Gateway Operational databases: How they work and when to use them Databricks partners with OpenAI on GPT-5.5 Announcing the Public Preview of Lakeflow Designer Are LLM agents good at join order optimization? How conversational analytics removes the BI bottleneck How to transform document activation workflows with Genie and Agent Bricks Beyond the spreadsheet: how Databricks is delivering the modern CFO in Financial Services AI App Development: Guide To Building AI-Powered Apps IoT in Manufacturing: Strategy, Components, Use Cases, and Challenges Stop Hand-Coding Change Data Capture Pipelines Multimodal Data Integration: Production Architectures for Healthcare AI Personalization Strategies for Media Companies A Modern AI Risk Management Framework Introducing the Databricks Excel Add-in for Business Users Real-Time Decisioning for AI Agents: Why you Need a Customer Context Layer First A Practical Guide to LLM Fine Tuning AI Data Transformation Guide for Data Engineers and Data Scientists Concurrency Control in DBMS: How Locking, MVCC and Optimistic Strategies Keep Data Consistent Bridging data science and marketing: Databricks unveils Delta Sharing integration for Adobe Experience Platform and agentic marketing workflows Take Control: Customer-Managed Keys for Lakebase Postgres Get hands on with agents, vibe coding and more at Data+ AI Summit Mercedes-Benz Builds a Cross-Cloud Data Mesh with Delta Sharing and Intelligent Replication, Cutting Costs by 66% What Is a Transactional Database? Introducing Genie Agent Mode Governing coding agent sprawl with Unity AI Gateway Governing Coding Agent Sprawl with Unity AI Gateway What is pgvector? Banks Don’t Have an AI Problem – They Have a Data Platform Problem Open Platform, Unified Pipelines: Why dbt on Databricks is Accelerating Why Your Agents Can’t Read Enterprise Documents — and How to Fix It Building with Databricks Document Intelligence and Lakeflow Databricks on Google Cloud: Innovate Faster. Smarter. Together. Introducing the Databricks Connector for Google Sheets: Real-Time, Governed Lakehouse Data in the Sheets Users Love Unity AI Gateway: How to connect agents to external MCPs securely Expanding agent governance with Unity AI Gateway Agentic reasoning in practice: Making sense of structured and unstructured data Agent Bricks: The Governed Enterprise Agent Platform 8 AI and data trends shaping financial services in 2026 Building real-time product search on Databricks Lovable + Databricks: Build Data-Driven Apps at the Speed of Thought Memory scaling for AI agents Powering clinical research innovation: How TriNetX uses Databricks to accelerate drug development Database Branching in Postgres: Git-Style Workflows with Databricks Lakebase How Zalando built a unified data foundation for AI and analytics on Databricks The next era of the open lakehouse: Apache Iceberg™ v3 in Public Preview on Databricks How FSIs eliminate silos between clients, operations, and finance How MakeMyTrip achieved millisecond personalization at scale with Databricks A multi-agent approach to audience intelligence AiChemy: Next-generation agent with MCP, skills and custom data for drug discovery Accelerate business insights with Lakeflow Connect, now with a Free Tier Unlocking Next-Gen Customer Experiences with Data Intelligence for Marketing
Accelerate search queries with full-text search indexes on Databricks
Yu Xu · 2026-06-16 · via Databricks

Full-text search indexes can accelerate queries by 100x or more on open format tables, without changing table layouts

by Yu Xu, Yingyi Bu and Ivan Vezilić

Every data team faces the same challenge as tables grow to hundreds of gigabytes or terabytes and beyond: text search queries become painfully slow, whether searching for an error message across terabytes or petabytes of application logs, finding a suspicious IP address in security data, or locating specific content in a compliance dataset. These queries can end up scanning far more data than necessary, making fast, targeted lookups a challenge at scale. Today, we’re excited to announce a solution to these challenges: full-text search indexes are available in Beta on Databricks Runtime 18.2.

Teams are often forced into workarounds: maintaining duplicate tables, building external search systems like Elasticsearch or Splunk, or over-engineering table layouts to optimize for one query pattern at the expense of others.

image2.png

What are full-text search indexes?

A full-text search index accelerates substring and keyword queries across text columns. Once you create one, the query engine uses it automatically, so searches that used to scan an entire table read only a small fraction of it. (We cover how this works below.)

Full-text search indexes are ideal for high-cardinality lookups and for searching across many text columns at once. Common examples include:

  • Log analytics and SIEM (Security Information and Event Management): Searching for error messages, IP addresses, or suspicious patterns across terabytes of security and application logs.
  • Trust and Safety investigations: Finding specific content across massive content moderation datasets.
  • Compliance auditing: Locating records containing specific terms in regulatory reporting data.

Get started with just a SQL statement:

Queries are then accelerated automatically:

The query engine discovers the search index and uses it to skip the vast majority of files, often accelerating queries by orders of magnitude.

How it works under the hood

Full-text search indexes are stored separately from the base table. When you create an index, Databricks tokenizes the text content and builds an internal index: a compact lookup structure that maps tokens to the matching rows. At query time, the engine consults this index to identify which files might contain matching rows, then reads only those files.

image3.png

This architecture delivers several key advantages:

  • Zero impact on write performance: Indexes are maintained asynchronously. Writing to the base table is never slowed down by indexing.
  • Automatic query optimization: The Databricks query engine evaluates available indexes and selects the best access path, with no query hints required.
  • Correctness guaranteed: Even when an index is stale (behind the base table), query correctness is preserved. Databricks scans both indexed and non-indexed portions of the table as needed, so results are always complete and accurate.
  • Works with Delta and Iceberg: Full-text search indexes support Unity Catalog managed Delta and Iceberg tables on both serverless and classic compute.

How do full-text search indexes relate to Liquid Clustering?

Liquid clustering and full-text search indexes solve different problems. Liquid clustering organizes your data physically so queries that filter on the clustering key can skip large chunks of data efficiently. Clustering helps with equality and range filters on column values, but it cannot help locate a substring or keyword within a field.

Full-text search indexes, by contrast, search within the text of column values, enabling fast substring and keyword lookups. This means full-text search indexes speed up queries even on columns you've already clustered on, because clustering alone cannot find a match within a field's content.

In short, Liquid clustering optimizes filtering by column values; full-text search indexes optimize searching within column values. They complement each other and work together on the same table.

Customer performance results

A Trust and Safety team adopted full-text search indexes for investigations over a petabyte-scale table.

A substring search that previously had to scan the entire table now runs more than 100x faster, making interactive investigations practical for the first time.

We are actively working on more efficient index layouts and other optimizations to deliver even greater speedup in upcoming releases.

Getting started

Full-text search indexes are available in Beta on Databricks Runtime 18.2. To get started, see the full-text search indexes documentation.

What’s next

The next milestone for full-text search indexes is 18.3, where you can expect:

  • Full Unity Catalog integration with automatic permission inheritance.
  • Automatic maintenance via Predictive Optimization: No more manual REFRESH INDEX. Your indexes are kept up to date automatically.

We’re eager to hear your feedback during the Beta period. Try full-text search indexes on your workloads today and help us shape the Public Preview release.