How to Give Claude Access to Snowflake Without Exposing PII

You want Claude — or Cursor, or ChatGPT, or any MCP-aware agent — to answer questions about your Snowflake data. You also do not want the agent to read social security numbers, free-text customer notes, or anything subject to GDPR / HIPAA / SOC 2. The default MCP setup hands the agent everything its connection role can see. That is the problem.

This post walks through five layers of defense, ordered from cheapest to most thorough. Each is independent — pick the ones that match your risk tolerance. The whole stack takes roughly an hour to set up on an existing Snowflake account.

The Default Posture (and Why It Is Wrong)

A typical MCP server for Snowflake — including the official one — connects with a service account, exposes a query tool, and lets the model run any SQL the role can run. That role is usually scoped to a warehouse and a database, but rarely to columns or row sets. The model gets a fluent SQL interface to your warehouse and the warehouse trusts every query it sees.

The blast radius is large. According to the 2025 IBM Cost of a Data Breach Report, the average cost of a data breach hit $4.88M, with breaches involving extensive cloud data exposure costing 23% more than average. Letting an AI agent run uncurated queries against a production warehouse is exactly the cloud-data-exposure category that drives the premium.

Layer 1: A Dedicated MCP Role

First step, every time: create a role that exists only for the agent. Do not reuse the analytics role, do not reuse the dbt role, and definitely do not use SYSADMIN.

Grant USAGE on the warehouse you want the agent to use. Use a small, dedicated warehouse (X-Small or Small) so a runaway query has a bounded cost ceiling.
Grant USAGE on the database and the specific schemas the agent should see.
Grant SELECT on the specific views the agent should query — not raw tables. Views give you a place to apply masking, filters, and joins without modifying the underlying data.
Never grant CREATE, INSERT, UPDATE, DELETE, or TRUNCATE. The agent is a read-only role.

A read-only role with view-only SELECT grants is roughly 80% of what most teams need. The remaining 20% is where the PII risk actually lives.

Layer 2: Column-Level Masking Policies

Snowflake supports masking policies that fire based on the executing role. The same SELECT statement returns the raw value for an analyst role and a masked value for the agent role. This is the single most important PII control because it does not depend on the agent or the MCP server behaving correctly.

A masking policy that returns SHA2(email) for any role except ANALYTICS_HUMAN means even if the model is jailbroken into producing a SELECT * query, it gets hashes, not addresses. The policy is enforced at the SQL engine layer, not at the application layer.

Apply masking policies to every column tagged as PII. If you do not have PII tags yet, an audit tool (or the Data Workers governance agent) can scan the schema and tag candidate columns automatically — emails, phone numbers, SSNs, free-text columns, IP addresses, dates of birth.

Layer 3: Row Access Policies

Masking hides values. Row access policies hide entire rows. For multi-tenant data — or any case where the agent should see only one customer's, one region's, or one fiscal-year's data — row access policies are the right primitive.

Common patterns: scope the agent role to the last 90 days of data, exclude rows tagged sensitive = true, restrict to a specific tenant_id. Like masking policies, these are enforced inside the engine — no application-layer code can bypass them.

Layer 4: Audit Logging

Every query the agent runs should be auditable for at least 30 days. Snowflake's QUERY_HISTORY view is the source of truth — it includes the SQL text, the executing role, the start and end times, and the rows returned. Pipe it into your SIEM (Datadog, Splunk, S3+Athena) so you can answer 'what did the agent see last week' without writing custom code.

Tag every agent-driven query with a comment header (e.g., /* mcp_agent=data_workers, session=abc123 */) so you can filter QUERY_HISTORY trivially.
Set up an alert for any agent query that returns more than 10,000 rows. That is almost never the intended behavior.
Set up a hard query timeout on the agent's warehouse (try 60 seconds to start). Runaway agents are cheap when they cannot run for 30 minutes.

Layer 5: Schema-Aware Catalog as a Guardrail

The most subtle PII leak is the one that comes from the agent picking the wrong table. The agent does not know that customers_legacy was deprecated in 2024 but never deleted. It does not know that orders_raw has unredacted payment data but orders has the cleaned version. Without a catalog, the agent picks whichever table sounds right.

A data catalog that the agent reads before writing SQL solves this. The agent asks the catalog: 'Where is order data?' and the catalog responds with the governed view, the ownership, the freshness, and the PII tags. The agent never sees the legacy table because the catalog never surfaces it.

This is exactly what Data Workers' Catalog Agent does. It exposes catalog discovery as MCP tools, so when Claude queries it for 'order data', it gets the governed answer — same response shape, same masking policies applied. The catalog itself enforces what the agent can see.

What Each Layer Buys You

Layer	Defends Against	Setup Time	Production Impact
Dedicated MCP role	Privilege escalation	10 min	None
Column masking	PII column exfiltration	20 min per table	<1ms per query
Row access policies	Tenant / scope leakage	30 min per table	<5ms per query
Audit logging	Detection after the fact	1 hr (with SIEM)	Storage cost
Catalog guardrail	Wrong-table selection	1 day to wire MCP	Adds 1 round-trip

Frequently Asked Questions

Do these controls work with ChatGPT and Cursor too, or just Claude? Yes. All of these are Snowflake-side controls. They apply regardless of which MCP client is connecting — Claude, Cursor, OpenClaw, ChatGPT (via remote MCP), or a custom agent.

What about BigQuery and Databricks? Same five layers. BigQuery has authorized views and column-level access controls; Databricks has Unity Catalog row filters and column masks. The naming differs, the pattern does not.

Will masking break joins or aggregations? Masking policies preserve datatype, so JOIN and GROUP BY still work — they just operate on the masked value. For HASH-based masks, this means 'group by hashed email' still gives you per-customer counts.

How do I know if my current MCP server is leaking PII? Run a query against QUERY_HISTORY filtered to your agent role, look at the SQL text, and check whether any of those queries select columns that should have been masked. If you cannot tell whether a column should be masked, you do not have PII tagging yet — start there.

The Data Workers governance agent ships a pii_audit_snowflake tool that does the scan, the tag, and the policy generation in one call. It is open source. The point of this post, though, is that you do not need it — five SQL-level controls and an hour of work close the largest part of the gap. The catalog guardrail is the icing.

If you are running into specific issues setting this up, we keep notes on what works in our open-source repo at github.com/DataWorkersProject/dataworkers-claw-community. Issues and PRs welcome.

Originally published at https://dataworkers.io/blog/claude-snowflake-without-pii-exposure/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.

推荐订阅源

DEV Community