惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Simon Willison's Weblog
Simon Willison's Weblog
Help Net Security
Help Net Security
P
Privacy International News Feed
T
Threat Research - Cisco Blogs
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
NISL@THU
NISL@THU
L
LINUX DO - 热门话题
Security Latest
Security Latest
A
Arctic Wolf
G
GRAHAM CLULEY
月光博客
月光博客
S
Securelist
D
Docker
J
Java Code Geeks
T
Troy Hunt's Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
SecWiki News
SecWiki News
S
Security @ Cisco Blogs
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
Recent Commits to openclaw:main
Recent Commits to openclaw:main
aimingoo的专栏
aimingoo的专栏
博客园 - 【当耐特】
H
Heimdal Security Blog
The Hacker News
The Hacker News
博客园 - 三生石上(FineUI控件)
Application and Cybersecurity Blog
Application and Cybersecurity Blog
N
Netflix TechBlog - Medium
Vercel News
Vercel News
Forbes - Security
Forbes - Security
B
Blog RSS Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
IT之家
IT之家
B
Blog
MongoDB | Blog
MongoDB | Blog
博客园 - 聂微东
Google DeepMind News
Google DeepMind News
S
Secure Thoughts
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Check Point Blog
云风的 BLOG
云风的 BLOG
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
The Blog of Author Tim Ferriss
L
Lohrmann on Cybersecurity
F
Full Disclosure
D
Darknet – Hacking Tools, Hacker News & Cyber Security
P
Proofpoint News Feed

Stanislav’s Big Data Stream

postgres can be your data lake (pg_lake) MongoBleed explained simply Event Streaming is Topping Out How AWS S3 serves 1 petabyte per second on top of slow HDDs What is Kafka Streams? Why Was Apache Kafka Created? kafka community spotlight: TAIWAN 🇹🇼 The Brutal Truth about Kafka Cost Calculators What To Expect? Coming soon
meet your new data lakehouse: S3 Iceberg Tables
Stanislav Kozlovski · 2024-12-05 · via Stanislav’s Big Data Stream

Reading time: 13 minutes.

  • S3 Tables is a new S3 Bucket Type, where anything you write via the S3 API gets translated into Parquet files and organized under the Iceberg open table format.

    • It offers additional functionality in maintaining/optimizing the tables, and in regular Iceberg fashion - is expected to be open to use from many query engines.

    • It is around 37% more expensive than S3 Standard.

  • S3 Metadata is a new automagic way to store metadata of any S3 bucket’s objects.

    • Once enabled, the system automatically stores and maintains (!) the metadata in a S3 Table (Iceberg table), in near real time.

    • This allows you to more-easily leverage modern query engines to analyze/visualize/process your data lake’s metadata.

  • Talk about Iceberg Catalogs, the open table format war and project neutrality

Stan: Hey! Welcome to the first edition of Big Data Stream! I’m excited to get this newsletter going.

Today, we will be diving into the seemingly-endless topic of Apache Iceberg and by extension - S3.

S3 announced two major features this past re:Invent.

  • S3 Tables

  • S3 Metadata

Let’s dive into it.

Image

This is first-class Apache Iceberg support in S3.

You use the S3 API, and behind the scenes it stores your data into Parquet files under the Iceberg table format. That’s it.

It’s an S3 Bucket type, of which there were only 2 previously:

  1. S3 General Purpose Bucket - the usual, replicated S3 buckets we are all used to

  2. S3 Directory Buckets - these are single-zone buckets (non-replicated).

    1. They also have a hierarchical structure (file-system directory-like) as opposed to the usual flat structure we’re used to.

    2. They were released alongside the Single Zone Express low-latency storage class in 2023

  3. new: S3 Tables (2024)

AWS is clearly trending toward releasing more specialized bucket types.

The “managed Iceberg service” acts a lot like an Iceberg catalog:

  • single source of truth for metadata

  • automated table maintenance via:

    • compaction - combines small table objects into larger ones

    • snapshot management - first expires, then later deletes old table snapshots

    • unreferenced file removal - deletes stale objects that are orphaned

  • table-level RBAC via AWS’ existing IAM policies

  • single source of truth and place of enforcement for security (access controls, etc)

While these sound somewhat basic, they are all very useful.

AWS is quoting massive performance advantages:

  • 3x faster query performance

  • 10x more transactions per second (tps)

This is quoted in comparison to you rolling out Iceberg tables in S3 yourself.

I haven’t tested this personally, but it sounds possible if the underlying hardware is optimized for it.

If true, this gives AWS a very structural advantage that’s impossible to beat - so vendors will be forced to build on top of it.

Out of the box, it works with open source Apache Spark.

And with proprietary AWS services (Athena, Redshift, EMR, etc.) via a few-clicks AWS Glue integration.

There is this very nice demo from Roy Hasson on LinkedIn that goes through the process of working with S3 Tables through Spark. It basically integrates directly with Spark so that you run `CREATE TABLE` in the system of choice, and an underlying S3 Tables bucket gets created under the hood.

The pricing is quite complex, as usual. You roughly have 4 costs:

  1. Storage Costs - these are 15% higher than Standard S3.

    1. They’re also in 3 tiers (first 50TB, next 450TB, over 500TB each month)

    2. S3 Standard: $0.023 / $0.022 / $0.021 per GiB

    3. S3 Tables: $0.0265 / $0.0253 / $0.0242 per GiB

  2. PUT and GET request costs - the same $0.005 per 1000 PUT and $0.0004 per 1000 GET

  3. Monitoring - a necessary cost for tables, $0.025 per 1000 objects a month.

    1. this is the same as S3 Intelligent Tiering’s Archive Access monitoring cost

  4. Compaction - a completely new Tables-only cost, charged at both GiB-processed and object count 💵

    1. $0.004 per 1000 objects processed

    2. $0.05 per GiB processed 🚨

Here’s how I estimate the cost would look like:

For comparison, 1 TiB in S3 Standard would cost you $21.5-$23.5 a month. So this ends up around 37% more expensive.

Compaction can be the “hidden” cost here. In Iceberg you can compact for four reasons:

  • bin-packing: combining smaller files into larger files.

  • merge-on-read compaction: merging the delete files generated from merge-on-reads with data files

  • sort data in new ways: you can rewrite data with new sort orders better suited for certain writes/updates

  • cluster the data: compact and sort via z-order sorting to better optimize for distinct query patterns

My understanding is that S3 Tables currently only supports the bin-packing compaction, and that’s what you’ll be charged on.

This is a one-time compaction1. Iceberg has a target file size (defaults to 512MiB). The compaction process looks for files in a partition that are either too small or large and attemps to rewrite them in the target size. Once done, that file shouldn’t be compacted again. So we can easily calculate the assumed costs.

If you ingest 1 TB of new data every month, you’ll be paying a one-time fee of $51.2 to compact it (1024 * 0.05).

The per-object compaction cost is tricky to estimate. It depends on your write patterns. Let’s assume you write 100 MiB files - that’d be ~10.5k objects. $0.042 to process those. Even if you write relatively-small 10 MiB files - it’d be just $0.42. Insignificant.

Storing that 1 TB data will cost you $25-27 each month.

Post-compaction, if each object is then 512 MiB (the default size), you’d have 2048 objects. The monitoring cost would be around $0.0512 a month. Pre-compaction, it’d be $0.2625 a month.

The second feature out of the box is a simpler one. Automatic metadata management.

S3 Metadata is this simple feature you can enable on any S3 bucket.

Once enabled, S3 will automatically store and manage metadata for that bucket in an S3 Table (i.e, the new Iceberg thing)

That Iceberg table is called a metadata table and it’s read-only. S3 Metadata takes care of keeping it up to date, in “near real time”.

The metadata that gets stored is roughly split into two categories:

  • user-defined: basically any arbitrary key-value pairs you assign

    • product SKU, item ID, hash, etc.

  • system-defined: all the boring but useful stuff

    • object size, last modified date, encryption algorithm

The cost for the feature is somewhat simple:

  • $0.00045 per 1000 updates

    • this is almost the same as regular GET costs. Very cheap.

    • they quote it as $0.45 per 1 million updates, but that’s confusing.

  • the S3 Tables Cost we covered above

    • since the metadata will get stored in a regular S3 Table, you’ll be paying for that too. Presumably the data won’t be large, so this won’t be significant.

A big problem in the data lake space is the lake turning into a swamp.

Data Swamp: a data lake that’s not being used (and perhaps nobody knows what’s in there)

To an unexperienced person, it sounds trivial. How come you don’t know what’s in the lake?

But imagine I give you 1000 Petabytes of data. How do you begin to classify, categorize and organize everything? (hint: not easily)

Organizations usually resort to building their own metadata systems. They can be a pain to build and support.

With S3 Metadata, the vision is most probably to have metadata management as easy as “set this key-value pair on your clients writing the data”.

It then automatically into an Iceberg table and is kept up to date automatically as you delete/update/add new tags/etc.

Since it’s Iceberg, that means you can leverage all the powerful modern query engines to analyze, visualize and generally process the metadata of your data lake’s content. ⭐️

Sounds promising. Especially at the low cost point!

All this is offered behind a fully managed AWS-grade first-class service?

I don’t see how all lakehouse providers in the space aren’t panicking.

Sure, their business won’t go to zero - but this must be a very real threat for their future revenue expectations.

People don’t realize the advantage cloud providers have in selling managed services, even if their product is inferior.

  • leverages the cloud provider’s massive sales teams

  • first-class integration

  • ease of use (just click a button and deploy)

  • no overhead in signing new contracts, vetting the vendor’s compliance standards, etc. (enterprise b2b deals normally take years)

  • no need to do complex networking setups (VPC peering, PrivateLink) just to avoid the egregious network costs

I saw this first hand at Confluent, trying to win over AWS’ MSK.

The difference here?

S3 is a much, MUCH more heavily-invested and better polished product…

And the total addressable market (TAM) is much larger.

Image
I made this funny visualization as part of the social media posts on the subject matter - “AWS is deploying a warship in the Open Table Formats war”

What we’re seeing is a small incremental step in an obvious age-old business strategy: move up the stack.

What began as the commoditization of storage with S3’s rise in the last decade+, is now slowly beginning to eat into the lakehouse stack.

Subscribe to not miss out on future issues!

Some notes on Iceberg announcements I have include:

But beside that - the open table format wars have been going on for a while.

It marked the rise of the so-called headless data architecture, where the storage layer (data) is decoupled from the query layers (engines) that use it. 💡

Despite being decoupled, the two layers will always be indirectly tightly coupled because efficiency in the query layer relies deeply on how the storage layer is done. (e.g optimizations that allow for data to be fetched efficiently for faster querying)

Because the table format essentially defines the storage layer, you had big dogs like Snowflake and Databricks aggressively competing with each other on the table formats.

Let’s not forget the absolutely humongous acquisition Databricks did by spending $1-2 BILLION on a company that was raking in $1-5M/yr. It’s precisely this competition that resulted in them outbidding each other for Tabular (a company founded by the Iceberg creators).

Talk about a revenue multiple…

I covered the story in a viral post here:

💙 LinkedIn Link

💚 Twitter/X Link

In my opinion, that is the seminal moment when Iceberg won the open table format war. When Iceberg’s main competitor - Databricks Delta Lake - shelled out this outrageous sum to acquire the creators of Iceberg.

AWS stood, watched and ate its popcorn. 🍿

In classic AWS fashion, they don’t compete in the protocol war.

They wait for it to settle (or show signs of it), then just come in and eat the cake with their scale.

So yes, I think Iceberg did win. 🎉.

I realize this comes as a bitter truth to all the people that bet on the competitor(s), and I’m happy to be proven wrong - but it seems obvious to my neutral eyes.

Image
I forgot to even mention Apache Paimon. I think the ship has sailed...

(for those that aren’t aware of the term)

Let’s not forget why Iceberg is the buzz to begin with. It’s an Open Table Format.

The open table format war was precisely around that - openness. The idea was that:

Storage and compute products should be largely interchangeable and easily swapped, by using open standards like Iceberg.

The Open Table Format revolution sells you the following narrative:

  • zero copy - share your database’s storage by storing it in one place but using it from different query engines (avoids copying data and the insane network costs associated with that)

    • this is a much bigger proposition that we won’t dive into, but it essentially promises you both a data warehouse and data lake (coined the data lakehouse in classic industry jargon) with the same storage layer

  • openness - avoid lock-in by being able to easily swap query engines

    • also increase interoperability between tools, therefore reducing complexity and cost by allowing you to use the right tool for the job

It sold you the narrative that YOU control your data and are free to port it anywhere. In other words, you.

I think yes, but there will always be a tug of war.

The goal of every infrastructure company is to lock you in.

Never forget that, despite what they say. It’s the clear incentive for everybody.

The most successful infrastructure software company is called Oracle and it is notorious for its bad practices. While companies don’t want the bad reputation associated with it, they definitely salivate over their business.

So how are companies trying to lock you in with an open table format?

The metadata.

An Iceberg Catalog is a piece of software that manages a collection of Iceberg tables and most importantly - its metadata.

Recall that a table is simply a collection of files - so the metadata is the source of truth of what constitutes a table. It is the equivalent to the relation database’s information_schema - without it, query engines would never be able to make sense of the data.

A catalog is essentially a metastore that also gatekeeps access to your Iceberg tables.

To mutate a table, you have to go through the catalog. You can read without it though

This has many benefits, like:

  • access control - define and enforce your security in one place

  • thread-safe access (to the root Iceberg metadata file)- allow multi-engine architectures via atomic transactions

That’s what locks you in.

If you define a gazillion security policies in the catalog’s proprietary format and it ends up storing other additional metadata - good luck moving to another provider.

And most certainly we have seen a battle there. It’s not an accident that weeks before the Tabular acquisition we saw major move in the Catalog wars:

  • June 3: Snowflake’s Open Source Polaris Iceberg Catalog announced

  • June 4: Databricks acquires Tabular

  • June 13: Databricks’ Unity Delta Catalog open sourced

This is amidst a gazillion other existing catalogs, like Starburst Catalog, Hive Metastore, JDBC, Project Nessie, LakeKeeper, AWS Glue and more I’m missing.

There is not one canonical catalog and that’s a problem - it’s huge overhead to compare between each and choose. Plus it’s a huge maintenance to maintain client libraries in all the different languages that the catalog may be implemented in.

To be fair, Iceberg saw the proliferation of engine programming languages required to operate with and has attempted to standardize on a given REST API for Catalogs. (since version 0.14.0 - July 2022). Adoption of this REST API has been good. (Lakekeeper, Nessie, Gravitino, Starburst, Dremio, Databricks, Snowflake)

As far as I can tell, any implementation of this Catalog interface is a catalog.

S3’s Tables Catalog is yet another catalog implementation. It implements the Iceberg Catalog interface in Java - by implementing the abstract BaseMetastoreCatalog. (I’m really not sure why they didn’t implement the REST API)

Basically a wrapper on top of the S3 Tables API.

There is one big differentiator here - access control.

AWS provides you the SAME authentication scheme (IAM Policy) that you’re:

  • used to

  • already a heavy user of (automation, etc.)

  • governs both your data files and your Catalog APIs

So setting up a separate catalog suddenly becomes extra effort in duplicating/maintaining access policies that must be justified.

They have a lot of verticals they can try to do so in, namely:

  • the RBAC-like IAM policies they offer for table-level access control

  • their own API

  • the table maintenance features they offer on top

  • better performance via underlying S3 optimizations

The Iceberg community already took a bittersweet L with the Tabular acqusition.

It’s ironic - Tabular had a very good blog post called “The Case for Independent Storage”.

There, Ryan Blue made the very strong argument that the industry needs and benefits from a neutral catalog provider, because its incentives would be aligned with the customers.

“I strongly believe in independent storage because I’ve seen the benefits first-hand. Our customers routinely see 30-60% savings from automatic tuning, and because we are neutral, these benefits apply across all compute environments. They’ve told us horror stories about tables that weren’t clustered, so every query was a full table scan and needlessly cost an extra $5 million per year. When you buy compute and storage from the same vendor, that company has little incentive to find and fix these types of problems. But that’s exactly what an independent storage vendor can and should do.”
- Ryan Blue

Not only that, but vendors can make their storage solutions more performant with their own query engines, or simply less performant with foreign query engines.

While some people were skeptical at the time of acquisition that now 2/3 of the open table format are “owned” by one company (Databricks) - the Iceberg community seems somewhat diverse.

Only 3/16 Iceberg PMC members were part of Tabular, and the top contributors were somewhat well split between Tabular, Dremio, Apple and a few others.

$ git clone https://github.com/apache/iceberg && cd iceberg
$ git shortlog --since=2022 --summary --numbered --all --no-merges | head -20
  350  Fokko D.         (Tabular)
  184  Anton O.         (Apple)
  179  Eduard T.        (Tabular)
  142  Eduard T.        (Tabular)
  123  Ajantha B.       (Dremio)
   68  Ryan B.          (Tabular)
   58  Steven Z. W.     (Apple)
   56  Amogh J.         (Tabular)
   51  Bryan K.         (Tabular)
   49  Manu Z.          (Unknown)
   48  Amogh J.         (Tabular)
   43  Prashant S.      (Amazon)
   40  Xianyang L.      (Tencent)
   37  Szehon H.        (Apple)
   34  Robert S.        (Dremio)
   33  pvary            (Apple)
   30  Daniel W.        (Tabular)
   30  Kyle B.          (Tabular)
   29  Yufei G.         (Apple)
   28  Hongyue/Steve Z. (Apple)
$ # ^^ last names truncated and org name added

(source for the Table is from Gilles Philippart, June 2024)

While there was concern that Databricks might have tried to steer the project more into its own lane, this action from Amazon is likely to result in push-back to any such endeavours.

If anything, the native S3 support is likely to invite more contributors from Amazon and by extension - the other two clouds.

I will be posting more in this letter, as it allows me to write in a more free-form style and include a ton of references. I find that freeing.

To NOT miss out on future issues: subcribe 👇

Nevertheless, feel free to check the other places I post:

And if you enjoyed this piece… consider sharing it with your team in Slack ✌️

Share