惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Cisco Talos Blog
Cisco Talos Blog
S
Securelist
C
Cisco Blogs
D
DataBreaches.Net
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Vulnerabilities – Threatpost
Latest news
Latest news
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
S
SegmentFault 最新的问题
罗磊的独立博客
I
Intezer
雷峰网
雷峰网
T
Threatpost
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
AWS News Blog
AWS News Blog
A
Arctic Wolf
P
Privacy International News Feed
The Register - Security
The Register - Security
Vercel News
Vercel News
L
LangChain Blog
S
Schneier on Security
D
Docker
J
Java Code Geeks
L
LINUX DO - 热门话题
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
M
MIT News - Artificial intelligence
Spread Privacy
Spread Privacy
MyScale Blog
MyScale Blog
量子位
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
K
Kaspersky official blog
C
CERT Recently Published Vulnerability Notes
Know Your Adversary
Know Your Adversary
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Recorded Future
Recorded Future
C
Cyber Attacks, Cyber Crime and Cyber Security
Scott Helme
Scott Helme
Security Latest
Security Latest
人人都是产品经理
人人都是产品经理
T
Threat Research - Cisco Blogs
Cyberwarzone
Cyberwarzone
F
Full Disclosure
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Jina AI
Jina AI
NISL@THU
NISL@THU
P
Proofpoint News Feed
T
The Blog of Author Tim Ferriss

PostHog's RSS Feed

Training our own AI models - PostHog From 270GB RAM to 5GB: Moving local flag evaluation from Django to Rust The best analytics stack for vibe-coded apps The do's and don'ts of minimum viable product marketing - PostHog The best MCP servers for startups, by workflow 4,063 errors closed without a human opening PostHog – here's what we learned - PostHog PostHog Code and the self-driving product - PostHog Why attacking your competitors online is dumb - PostHog The best real-time analytics platforms for developers, compared DuckDB vs ClickHouse: Why we use both at PostHog - PostHog PostHog's next chapter - PostHog Making Claude Cowork actually useful - PostHog PostHog vs Matomo in-depth tool comparison You're doing lifecycle emails wrong Untangling Tokio and Rayon in production: From 2s latency spikes to 94ms flat The best HIPAA-compliant A/B testing tools - PostHog A beginner's guide to testing AI agents - PostHog I hate the standup bot (so I built an agent to do it for me) - PostHog The best CDPs for developers, compared The best error tracking tools for developers, compared The best feature flag software for developers, compared 7 best session replay tools for mobile apps 7 best free open source business intelligence tools right now 7 best free and open source LLM observability tools PostHog vs LogRocket in-depth tool comparison The most popular PostHog alternatives, compared Open source (and self-hosted) session replay tools - PostHog The 9 best GA4 alternatives for apps and websites - PostHog PostHog vs Google Analytics 4 in-depth tool comparison How we built automatic clustering for LLM traces - PostHog The 7 best HIPAA-compliant analytics tools 8 best open source analytics tools you can self-host - PostHog The best product analytics tools for startups, compared PostHog vs FullStory in-depth tool comparison The best in-app survey tools for product teams, compared The 7 best mobile app analytics tools PostHog vs Hotjar in-depth tool comparison The 8 best free and open-source feature flag services - PostHog The 5 best free and open-source A/B testing tools - PostHog The best mobile app A/B testing tools, compared What is a feature flag? Feature Flags vs Remote Config vs A/B Testing PostHog is now available in Vercel’s v0 The best Heap alternatives & competitors, compared PostHog vs Heap in-depth tool comparison PostHog vs Pendo in-depth tool comparison PostHog × Vercel: feature flags, minus the plumbing Your logs' final destination is in GA. You always end up here anyway Behind the scenes of a PostHog hackathon - PostHog The most popular Mixpanel alternatives & competitors, compared PostHog vs Mixpanel in-depth tool comparison The 9 best GDPR-compliant analytics tools How we use Logs at PostHog The best web analytics tools for developers, compared Stop AI slop: Run evals with LLM-as-a-Judge - PostHog You product data just got a job: Workflows is now out App onboarding: How to fix drop-off points Meet Logs (beta) – logs with all the tools you’re already using Why small teams crush tiger teams How we built user behavior analysis with multi-modal LLMs (in 5 not-so-easy steps) - PostHog The best Contentsquare alternatives & competitors, compared 8 learnings from 1 year of agents – PostHog AI - PostHog Why we killed our AI product assistant Workflows graduate to beta! Product data, meet automation The best Rollbar alternatives & competitors, compared Workflows are now in Alpha and I already broke mine - PostHog I've consistently underestimated how important communication is as a CEO - PostHog How we made feature flags even faster and more reliable The best session replay tools for developers, compared What I learned attending my first ever hackathon - PostHog Did you know AI is answering our community questions? - PostHog How not to be boring - PostHog We built an internal tool to generate changelog images for social media - PostHog What we built at our windswept Mykonos hackathon - PostHog How we built our onboarding email flow (with actual performance data) - PostHog We're building a better PostHog community by closing our public Slack - PostHog Introducing Notebooks for PostHog - PostHog Why we've launched PostHog user surveys - PostHog How we made feature flags faster and more reliable - PostHog In-depth: ClickHouse vs Redshift - PostHog Introducing HouseWatch: An open-source toolkit for ClickHouse - PostHog Introducing HogQL: Direct SQL access for PostHog - PostHog What we built at our sun-kissed Aruba hackathon - PostHog In-depth: ClickHouse vs BigQuery - PostHog HogMail #22: Why do companies over-hire?" - PostHog Our simpler goal: Help engineers to be better at product - PostHog In-depth: ClickHouse vs Snowflake - PostHog HogMail #21: Avoiding the "Product Death Cycle" - PostHog Sunsetting Kubernetes support for PostHog - PostHog Why 'Product Engineer' is the most fun role I've had in tech - PostHog HogMail #20: Why do startups fail? - PostHog The best Google Optimize alternatives for apps and websites - PostHog Array 1.43.0: Massive performance improvements! - PostHog In-depth: ClickHouse vs Druid - PostHog HogMail #19: Which meetings should you kill? - PostHog CEO diary: The things I learned in 2022 - PostHog The essential tools used by product engineers - PostHog HogMail #18: What can SaaS learn from the New York Times? - PostHog What is a product engineer? - Product Engineer Handbook - PostHog Array 1.42.0: Get beta features via our roadmap! - PostHog HogMail #17: The personal traits that can't be taught - PostHog
In-depth: ClickHouse vs Elasticsearch - PostHog
Mathew Prega · 2023-03-20 · via PostHog's RSS Feed

Elasticsearch and ClickHouse are both open-source frameworks with advantages over conventional databases like PostgreSQL for performing tasks over lots of data, but they serve very different needs.

Elasticsearch, as the name implies, was designed to power better search. It can efficiently return search results, such as grocery items on a grocer’s website, accounting for things such as spelling mistakes. It's the bedrock product for Elastic, which sells Elastic Cloud – a managed solution that bundles Elasticsearch with other data products.

Elasticsearch Development.png

ClickHouse, meanwhile, excels at aggregating data for uses like business analytics or financial statistics. While the database, ClickHouse, remains open source, it is managed by the for-profit ClickHouse Inc. ClickHouse Inc.’s main offering is ClickHouse Cloud, a managed service similar to Elastic Cloud, just for deploying ClickHouse instead. However, ClickHouse also merges notable contributions by Altinity, a separate company that sells Altinity.Cloud, a managed service for deploying ClickHouse in Kubernetes.

ClickHouse Development.png

Elasticsearch and ClickHouse are interesting to compare because of their vastly different architecture, optimized for each of their respective goals. Comparing them is a good meditation on how physical and virtual layouts can improve efficiency toward a specific efficiency goal.

Sometimes, the relationship between an open-source tool and its lead developer is complicated. ClickHouse's relationship is straightforward, but Elastic has a complex history with open source.

What is Elasticsearch?

Elasticsearch was originally released in 2010 under an open-source license. The premise behind Elasticsearch was that Apache Lucene, an open-source product designed to efficiently search JSON documents, needed better infrastructure for scaling. Apache Lucene made it easy to organize and search a series of JSON documents – such as human profiles; Elasticsearch made it easy to distribute those human profiles, which might be in the billions, across multiple locations, indexed both physically and virtually.

Elasticsearch is considered a NoSQL database because it uses Apache Lucene – and by extension, JSON documents – as a primary store of data. Specifically, it is a Document-Store NoSQL database with a focus on searching and retrieving data. It is never used as the primary store of data. Elasticsearch data stores are often redundantly available in a more traditional database like PostgreSQL as Elasticsearch is only leveraged to improve search results.

In 2021, Elasticsearch abandoned its traditional Apache Open Source license in favor of a new license known as an Elastic license. It was a controversial move motivated by Elastic’s irritation with Amazon profiting off of Elasticsearch by operating a managed service without ever contributing to the codebase. Amazon forked the last version of open-source Elasticsearch into a new open-source project, OpenSearch. Similar to Elastic (and ClickHouse Inc.), Amazon launched a managed version of OpenSearch.

Elasticsearch’s new license allows developers to implement Elasticsearch themselves, but forbids cloud distributors from running a for-profit, managed Elasticsearch service. Most open-source advocates consider Elastic’s Elastic License not open-source; however, it would be unfair to Elastic to equate their solution’s transparency with a purely closed-source solution like Snowflake.

Elastic also develops Kibana, a visualization program that plugs into Elasticsearch. It was also developed under an open-source license then shifted to an Elastic License in 2021. Kibana provides an interface for designing a dashboard that showcases Elasticsearch data.

What is Clickhouse?

ClickHouse is a traditional open-source project, but it started as a proprietary application. ClickHouse was originally built by Yandex for Yandex.Metrica, a massive analytics tool popular in Russia. Eventually, ClickHouse spun out into an independent, open-source project. Today, it is managed by ClickHouse Inc. with notable contributions by a separate organization, Altinity Inc.

ClickHouse was designed to return aggregate values of big data at millisecond speeds. ClickHouse accomplishes this through a series of clever techniques, including using a columnar store, dynamic materialized views, and specialized engines that take advantage of multiple cores.

Similar to Elastic Cloud, ClickHouse can be (optionally) deployed through various managed, closed-source solutions. ClickHouse Inc. offers a managed service known as ClickHouse Cloud. ClickHouse Cloud includes a GUI, similar to Kibana, for querying and visualizing data. Separately, Altinity Inc offers a managed service known as Altinity Cloud that specializes in deploying ClickHouse on Kubernetes.

The biggest, defining difference between Elasticsearch and ClickHouse is their respective techniques for storing and organizing data.

ClickHouse is a columnar database; it stores data in a table, just with an inverted structure (in disk) relative to a traditional MySQL or PostgreSQL table. ClickHouse’s columnar data store simplifies aggregating data.

Elasticsearch isn’t columnar – it isn’t even a table-based database. It stores data as documents, grouping sets of documents into shards, which are part of physical and virtual collections respectively known as nodes and indices.

Elasticsearch’s structure explained

Elasticsearch is best understood by separating the virtual structures from the physical structures.

Documents (virtual)

A base item in Elasticsearch is known as a Document. A Document is akin to a row of table data in MySQL – it has attributes, known as fields, stored in a JSON schema.

An Elasticsearch document might look something like this:

Fields (virtual)

A field is an attribute of a document. In the previous example, the fields were accountname, balance, and email. Fields make it easy for documents to be indexed and retrieved. Obviously, they also are used to store the data that applications use and present to users.

Indices (virtual), nodes (physical), and shards (virtual)

The three major core components of Elasticsearch’s infrastructure are indices, nodes, and shards. A document in Elasticsearch is part of two discrete collections:

  1. A node, a physical machine that stores the data. Akin to a physical MySQL server, such as a device sitting in an Idaho data center.

  2. An index, a virtual collection that defines what type of data it is. Akin to a table in MySQL, like a collection of bank accounts, student profiles, or property listings.

  3. A shard, meanwhile, is the intersection of a specific node and a specific index. A shard is also a single instance of Apache Lucene. It is a collection of documents, such as two hundred user profiles of a total set of forty thousand.

Elasticsearch effectively creates a cartesian layout of physical and virtual coordinates.

Elasticsearch effectively creates a cartesian layout of physical and virtual coordinates.

Inverted index

In each shard (or Apache Lucene instance) is an inverted index. An inverted index is like a glossary – it stores a map of string components (such as words, numbers, or prefixes) for all the documents they are located in.

Inverted indexes dramatically improve search time.

Inverted indexes dramatically improve search time.

Inverted indexes dramatically speed up most queries. If a user queries for all the Reviews that use the word “outstanding”, Elasticsearch can return that collection extraordinarily fast because each shard in the Reviews index leverages an inverted index to find relevant Reviews, and Elasticsearch bundles Reviews into a single collection for the end user.

Inverted indices do not only index words and numbers, but derivatives of words. This helps with accounting for human error. For instance, Elasticsearch (or rather, Apache Lucene) will convert each word into its phonetical form and store that in an inverted index as well. That way, users can find documents with “bear” spelled “bare” with a single query.

Likewise, Elasticsearch stores prefixes, suffixes, and n-grams. And, given each word is also stored in an inverted index, Elasticsearch can leverage simple word-likeness algorithms like Levenshtein Distance to account for typos.

In short, Elasticsearch extended Apache Lucene’s inverted index into a scalable, distributed system that leverages its benefits via parallelization.

This split between a virtual and physical index is what makes Elasticsearch’s ultra-fast search possible. Because Elasticsearch can perform parallel queries across nodes, multiple nodes slice and dice the search time to find a specific document.

For instance, imagine 100,000 documents stored in a single, consolidated index at 1 node. If Elasticsearch took approximately 1 second to search 10,000 documents, then searching 100,000 documents would take ~10 seconds.

Now imagine 100,000 Documents sharded across 10 nodes with 10,000 documents each. Because each node would take ~1 second to search all of the documents in its shard, and this process is parallelized, Elasticsearch can cut the search time from ~10 seconds to ~1 second.

Simply, “divide and conquer” is Elasticsearch’s middle name and trademark feature.

Replica Shards

Elasticsearch nodes and shards aren’t just used to distribute data, but also replicate it.

Elasticsearch has two types of shards – primary shards and replica shards. Replica shards are an exact copy of a primary shard should a primary shard become unavailable. A primary shard and a respective replica shard reference the same set of data. Therefore, they should never be located on the same node.

Elasticsearch can replicate data at scale without having to replicate the entire database.

Elasticsearch can replicate data at scale without having to replicate the entire database.

Replica shards help database operations in two distinct ways:

  1. They protect users against data loss in case a node – which is a physical machine – fails.

  2. Replicas are actively used for querying, so if multiple queries are targeting the same data, replica nodes can help distribute the reads, expediting results.

Clusters

A cluster is a group of nodes. Many applications only have one cluster, though some may have multiple clusters spread over different geographies to serve clients with lower latency. Each Elasticsearch cluster has a single master node that helps delegate and manage other nodes.

Clickhouse’s structure explained

ClickHouse is engineered to process data in a massive, consolidated place. Unlike Elasticsearch, ClickHouse’s optimizations don’t happen through distributing data, but by efficiently pre-processing it in anticipation of queries.

There are three major components that enable ClickHouse to return aggregations, such as averages, sums, and standard deviations, in millisecond times over petabytes of data.

Component 1: Columnar layout

ClickHouse’s columnar layout – which flips rows and columns in storage relative to a MySQL database – makes aggregations efficient.

ClickHouse’s biggest magic trick really comes down to swapping rows and columns

ClickHouse’s biggest magic trick really comes down to swapping rows and columns

When databases physically access data, they scan data row-by-row. By extension, if an analyst is trying to calculate the average value of bank account balances in a PostgreSQL database, they would need to access every bank account row. Alone, that would probably blow out memory. But in ClickHouse, the same analyst would only need to access one (physical) row of data – the bank balance one – and collapse it into an average.

Again, this is a physical row of data. As far as ClickHouse’s interface goes, data is still stored in a traditional format. ClickHouse’s syntax still treats individual entries as rows and attributes as columns. But under the hood, ClickHouse stores the data in an inverted arrangement, optimized for merging attribute data into single values.

Component 2: Materialized views

ClickHouse’s second superpower is dynamic materialized views.

Visualizing Materialized Views.png

Materialized views are not a new concept – in MySQL or PostgreSQL, a materialized view is a new table that can be queried from, rendered by a SQL query accessing other tables. However, once new data is added to the core tables, that materialized view goes out-of-date. Because creating materialized views is often expensive in traditional databases given their non-columnar layout, refreshing materialized views can only happen occasionally.

But ClickHouse truly makes materialized views dynamic. ClickHouse doesn’t only accomplish this because of the columnar layout of its data. It also leverages incremental data structures that merges data strategically.

Component 3: Specialized engines

ClickHouse has a series of specialized engines that enable developers to take advantage of multiple CPUs in parallel on the same machine. For instance, there is an engine for summing data (SummingMergeTree) and removing duplicates (ReplacingMergeTree). This technique has some resemblance to Elasticsearch’s parallelization across multiple machines to expedite search; ClickHouse does it at a more granular, per-machine level.

Sharding

ClickHouse has some overlap with Elasticsearch’s sharding features. ClickHouse extends Apache Zookeeper to manage multiple instances of ClickHouse should data need to be split across machines. However, this concept of sharding is closer to Elasticsearch’s support for multiple clusters – it is more a big data distribution problem, not a smaller optimization for speeding up queries.

Architecture summary

At a high level, ClickHouse and Elasticsearch’s differences showcase how they are designed to fit their own purposes. ClickHouse consolidates data so it can constantly update materialized views to serve number-hungry queries. Meanwhile, Elasticsearch is designed to find specific items, treating search queries as a group project where every node does its part.

While ClickHouse supports multiple instances, managed by Apache Zookeeper, it does not offer a decentralized solution competitive with Elasticsearch’s model. Likewise, while Elasticsearch offers data frames analytics, which has some overlap with ClickHouse’s materialized views, it is either more expensive or not as dynamic as ClickHouse’s fine-tuned aggregation machine.

Elasticsearch and ClickHouse both have small, medium, and enterprise customers.

Elasticsearch’s customers utilize it to return a specific chunk of data quickly to users. For instance, Uber uses Elastic to return data relevant to calculating surge pricing on a minute-by-minute basis. Tinder uses Elastic to fetch potential matches that might fit a user’s profile. T-Mobile uses Elasticsearch to delivery specific user profiles to customer support reps to promote better NPS scores.

In all of these examples, Elastic fetches something specific very efficiently.

Conversely, ClickHouse is used to return aggregations of data. The most obvious example would be us. We use ClickHouse to power PostHog, an open-source analytics suite that involves hundreds of aggregate values. Previously, Posthog was powered by PostgreSQL, which quickly spiraled out of control as we grew.

Others, like us, also use ClickHouse to power user-facing features – Rokt, an e-commerce platform, uses ClickHouse to power its analytics panels. However, some companies leverage ClickHouse for internal use cases, such as the Washington Post, which uses ClickHouse to power its in-house analytics suite.

ClickHouse was built to perform aggregations, but it’s naive to say that Elasticsearch doesn’t have the structure to compete with ClickHouse on some aggregations.

To understand this, remember the primary philosophy behind ClickHouse’s design: pre-calculate aggregations ahead of queries to enable millisecond-level fetches. ClickHouse accomplishes this through materialized views and specialized engines, which are optimized for mathematical queries transversing numeric data.

Elasticsearch, meanwhile, can accomplish similar performance over certain queries. For instance, if a product needs the number of college alumni that are unemployed, Elasticsearch can add up indices in the inverted index of words that pattern-match to unemployed alumni. In other words, fast search sometimes equates to great analytics by just adding a COUNT() function.

While both Elasticsearch and ClickHouse are fundamentally backend products, we can compare their respective GUI products – Kibana for Elasticsearch and ClickHouse Cloud for ClickHouse. ClickHouse Cloud is a much younger product; Kibana, conversely, has been around for nearly a decade and has an extensive UI.

Comparinson.png

In a nutshell, comparing the analytics efficiency of ClickHouse and Elasticsearch has the same sort-of, not-really awkwardness of other comparisons – they both excel in their respective categories using radically different methods to cater to a different type of need. However, Elastic’s Kibana product is more mature than ClickHouse Cloud’s competitive offering.

ClickHouse and Elasticsearch are both fantastic solutions for data aggregation and fast search respectively.

Elasticsearch grew quickly thanks to fast-paced development by its parent company, Elastic. Elastic has spearheaded the development of Elasticsearch, Kibana, and other accessory products like Beats, a data shipper. For many enterprise customers, Elastic being a pseudo-closed-source solution is balanced by the fact that it can leverage its enterprise revenue to foster a massive engineering effort to improve Elastic.

Conversely, while the team behind ClickHouse is smaller, ClickHouse has an avid developer community, with contributors existing outside of the two major ClickHouse developers – ClickHouse Inc and Altinity Inc. And one of the reasons that ClickHouse is starting to grow in the last few years is because of its open-source, pro-community brand, and it's blistering performance.

Overall, Elasticsearch remains a good solution if data aggregation involves searching text. It is a more mature project with an entire suite dedicated to interfacing with Elasticsearch data. However, it is no longer a true open-source product like ClickHouse is, and isn't designed to support the kind of high-performance use cases ClickHouse excels in.

Further reading

Subscribe to our newsletter

Product for Engineers

Read by 100,000+ founders and builders

We'll share your email with Substack