惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Stack Overflow Blog
Stack Overflow Blog
WordPress大学
WordPress大学
罗磊的独立博客
S
Secure Thoughts
Schneier on Security
Schneier on Security
博客园 - Franky
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
爱范儿
爱范儿
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Hacker News: Ask HN
Hacker News: Ask HN
PCI Perspectives
PCI Perspectives
Google DeepMind News
Google DeepMind News
S
Security Affairs
SecWiki News
SecWiki News
博客园 - 聂微东
Security Archives - TechRepublic
Security Archives - TechRepublic
Google Online Security Blog
Google Online Security Blog
H
Heimdal Security Blog
S
Security @ Cisco Blogs
Engineering at Meta
Engineering at Meta
C
CXSECURITY Database RSS Feed - CXSecurity.com
Cloudbric
Cloudbric
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
Visual Studio Blog
P
Proofpoint News Feed
Project Zero
Project Zero
T
Threat Research - Cisco Blogs
Webroot Blog
Webroot Blog
Blog — PlanetScale
Blog — PlanetScale
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
W
WeLiveSecurity
Last Week in AI
Last Week in AI
月光博客
月光博客
Microsoft Azure Blog
Microsoft Azure Blog
M
MIT News - Artificial intelligence
有赞技术团队
有赞技术团队
S
Securelist
GbyAI
GbyAI
Application and Cybersecurity Blog
Application and Cybersecurity Blog
C
CERT Recently Published Vulnerability Notes
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Cyberwarzone
Cyberwarzone
B
Blog RSS Feed
P
Palo Alto Networks Blog
H
Hacker News: Front Page
D
Docker
雷峰网
雷峰网
Latest news
Latest news
Microsoft Security Blog
Microsoft Security Blog

PostHog's RSS Feed

Training our own AI models - PostHog From 270GB RAM to 5GB: Moving local flag evaluation from Django to Rust The best analytics stack for vibe-coded apps The do's and don'ts of minimum viable product marketing - PostHog The best MCP servers for startups, by workflow 4,063 errors closed without a human opening PostHog – here's what we learned - PostHog PostHog Code and the self-driving product - PostHog Why attacking your competitors online is dumb - PostHog The best real-time analytics platforms for developers, compared DuckDB vs ClickHouse: Why we use both at PostHog - PostHog PostHog's next chapter - PostHog Making Claude Cowork actually useful - PostHog PostHog vs Matomo in-depth tool comparison You're doing lifecycle emails wrong Untangling Tokio and Rayon in production: From 2s latency spikes to 94ms flat The best HIPAA-compliant A/B testing tools - PostHog A beginner's guide to testing AI agents - PostHog I hate the standup bot (so I built an agent to do it for me) - PostHog The best CDPs for developers, compared The best error tracking tools for developers, compared The best feature flag software for developers, compared 7 best session replay tools for mobile apps 7 best free open source business intelligence tools right now 7 best free and open source LLM observability tools PostHog vs LogRocket in-depth tool comparison The most popular PostHog alternatives, compared Open source (and self-hosted) session replay tools - PostHog The 9 best GA4 alternatives for apps and websites - PostHog PostHog vs Google Analytics 4 in-depth tool comparison How we built automatic clustering for LLM traces - PostHog The 7 best HIPAA-compliant analytics tools 8 best open source analytics tools you can self-host - PostHog The best product analytics tools for startups, compared PostHog vs FullStory in-depth tool comparison The best in-app survey tools for product teams, compared The 7 best mobile app analytics tools PostHog vs Hotjar in-depth tool comparison The 8 best free and open-source feature flag services - PostHog The 5 best free and open-source A/B testing tools - PostHog The best mobile app A/B testing tools, compared What is a feature flag? Feature Flags vs Remote Config vs A/B Testing PostHog is now available in Vercel’s v0 The best Heap alternatives & competitors, compared PostHog vs Heap in-depth tool comparison PostHog vs Pendo in-depth tool comparison PostHog × Vercel: feature flags, minus the plumbing Your logs' final destination is in GA. You always end up here anyway Behind the scenes of a PostHog hackathon - PostHog The most popular Mixpanel alternatives & competitors, compared PostHog vs Mixpanel in-depth tool comparison The 9 best GDPR-compliant analytics tools How we use Logs at PostHog The best web analytics tools for developers, compared Stop AI slop: Run evals with LLM-as-a-Judge - PostHog You product data just got a job: Workflows is now out App onboarding: How to fix drop-off points Meet Logs (beta) – logs with all the tools you’re already using Why small teams crush tiger teams How we built user behavior analysis with multi-modal LLMs (in 5 not-so-easy steps) - PostHog The best Contentsquare alternatives & competitors, compared 8 learnings from 1 year of agents – PostHog AI - PostHog Why we killed our AI product assistant Workflows graduate to beta! Product data, meet automation The best Rollbar alternatives & competitors, compared Workflows are now in Alpha and I already broke mine - PostHog I've consistently underestimated how important communication is as a CEO - PostHog How we made feature flags even faster and more reliable The best session replay tools for developers, compared What I learned attending my first ever hackathon - PostHog Did you know AI is answering our community questions? - PostHog How not to be boring - PostHog We built an internal tool to generate changelog images for social media - PostHog What we built at our windswept Mykonos hackathon - PostHog How we built our onboarding email flow (with actual performance data) - PostHog We're building a better PostHog community by closing our public Slack - PostHog Introducing Notebooks for PostHog - PostHog Why we've launched PostHog user surveys - PostHog How we made feature flags faster and more reliable - PostHog In-depth: ClickHouse vs Redshift - PostHog Introducing HouseWatch: An open-source toolkit for ClickHouse - PostHog Introducing HogQL: Direct SQL access for PostHog - PostHog What we built at our sun-kissed Aruba hackathon - PostHog In-depth: ClickHouse vs BigQuery - PostHog In-depth: ClickHouse vs Elasticsearch - PostHog HogMail #22: Why do companies over-hire?" - PostHog Our simpler goal: Help engineers to be better at product - PostHog In-depth: ClickHouse vs Snowflake - PostHog HogMail #21: Avoiding the "Product Death Cycle" - PostHog Sunsetting Kubernetes support for PostHog - PostHog Why 'Product Engineer' is the most fun role I've had in tech - PostHog HogMail #20: Why do startups fail? - PostHog The best Google Optimize alternatives for apps and websites - PostHog Array 1.43.0: Massive performance improvements! - PostHog In-depth: ClickHouse vs Druid - PostHog HogMail #19: Which meetings should you kill? - PostHog CEO diary: The things I learned in 2022 - PostHog The essential tools used by product engineers - PostHog HogMail #18: What can SaaS learn from the New York Times? - PostHog What is a product engineer? - Product Engineer Handbook - PostHog Array 1.42.0: Get beta features via our roadmap! - PostHog
What launching Experimentation taught us about running effective A/B tests - PostHog
Neil Kakkar · 2022-03-23 · via PostHog's RSS Feed

We just launched our Experimentation suite, and there's a ton we learned about running successful experiments.

It was a no brainer product decision: Since you're already analysing your data in PostHog, and you're already using Feature Flags to roll out new features, why not have the capability to test how well these features are doing? Plus, what is the world without a great open-source A/B testing tool?

Experiments allow you to choose a target metric, choose specific people to run this experiment on, and set how long the experiment runs for.

PostHog - Experiment Creation

Thanks to Feature Flags, you can then easily validate whether each variant looks good, launch your experiment, and wait for data to come in. We then run a Bayesian analysis on the data to give a probability for each variant being the best, a graph of how things are looking for each variant, and whether the results are statistically significant or not.

PostHog - Experiment Results

It's a powerful tool for building great products, but that's enough about how experiments work for now. If you're interested in the technical details, check out the Experimentation user guide.

Instead, I'm going to share the three key things we've learned about running effective A/B tests so you can get the most out of this new feature.

This article is Chapter III in our A Universe of New Features launch week series

Let's say you're running an experiment to optimize the number of times users interact with PostHog graphs. Specifically, you're testing out different layouts for funnels - horizontal and vertical - and want to find which one leads to more interactions.

You can choose one of two metrics, but which one is right?

  1. Number of interactions across all graphs, not just funnels.
  2. Number of interactions for funnels.

Note that you're choosing the total interactions here, not unique interactions so if one person clicks on the funnel three times, that counts as three interactions for either metric, as it should.

There's a big problem with metric #1: It's global, and a lot more susceptible to things out of our control. For example, if Trends power users are somehow assigned to the control group, the data will have a big skew towards control which has nothing to do with the different funnel layouts.

We found this to be the case in reality - the more specific the metric, the fewer outside factors affect your result. Focusing on local optimization gives you better local information.

At the same time, you don't want to discard second order effects. What if the horizontal funnel layout prevents users from switching to other graphs? This might increase funnel interactions (local metric #2 increases), but at a cost to the global metric (#1).

To solve this problem, we introduced secondary metrics. We encourage making the main metric as the local metric, and then allow the option of having a few secondary metrics. We don't do significance analysis on these secondary metrics, but show the metric values for each variant, so you can ensure that there's no huge drop in global metrics while deciding on results.

Another advantage of local metrics over global is that it can be hard to reliably move global metrics2. Local metrics allow you to see changes faster, since they're narrower in scope, and thus move quicker.

You just finished running the experiment above, and the results are in. Horizontal funnel layout had 1,000 interactions, while vertical funnel layout had 1,200. The results ended up being significant, with vertical funnel layout being 20% better.

All well and good... except this goes against everything you intuitively know about using your own product. You find the vertical view congested, hard to parse, and sort of terrible.

Do you completely trust the data, or your intuition?

Both have issues. Your intuition might be how you see the world, but not necessarily how people who use the product see the world. At the same time, what if there was a bug in the vertical layout implementation, which counted each interaction twice? Maybe the 'real' number was 600, instead of 1,200, which massively changes your product direction.

Perhaps unhelpfully, I'd recommend neither blindly trusting the data, nor your intuition. Experiments show you what is happening, but can't answer why. The real institutional knowledge comes with answering the why, and building up an accurate model of who your users are, what they need, and how they interact with your product.

To answer the why, you need to talk through the causes. Create hypotheses about why this is happening , watch user session recordings, and then make a call about what you want to do.

That is, bring data to conversations, but also talk through causes.

You've finally got results for the experiment above, and figured out why they're like this. Turns out, the vertical layout promotes interaction – it allows users to see all steps of the funnel in one go, click on the steps that seem surprising, see the persons involved in that step, watch their recordings, etc. The horizontal layout, meanwhile, is a bit more frustrating to see all this information at a glance, causing faster bounces.

That's a model that keeps on giving, even when things change.

Let's say it's now three months in the future, and you've done a design revamp. Horizontal bars are thinner now, while vertical bars are thicker. As a result, horizontal funnels fit in cleanly on screen, while vertical funnels don't.4

You could run an experiment again to find if user preferences have changed, but if your model is right, interactions should start going down, and you can make the call to revert back to a horizontal layout.

ULtimately, experiment results don't stand the test of time - we cannot stress enough the importance of extracting a useful model out of your experiment results.

Another interesting thing we learned is that we can't simply run experiments for web products like you would a clinical trial. Rigor is important, but if it takes you a year to make up your mind about a vertical vs. horizontal layout, you'll be in trouble.

This kind of rigor makes sense when you're developing a new drug and optimizing for risk mitigation: there's lives at stake, and mistakes do result in casualties. Further, you can take things slow because human biology is reliably consistent.

By contrast, web products are much lower stake, and are present in an ever-changing environment. Culture and individual preferences can change rapidly, and the cost of getting experiments wrong isn't too high - you can easily revert them later on.

Moving quickly trumps rigor in web product experiments.

We built Experimentation with this in mind. It's a web product, built for products that move quickly.

Another example where this difference shines through is the peeking problem.

The strawman, pop-sci version of the peeking problem goes something like: "You shouldn't look at experiment results while the experiment is running because that can lead to you ending experiments early, when the data is skewed in favour of one variant, thanks to random chance."

However, peeking isn't the problem. The problem is taking action too quickly after peeking.

We built this into our product. Peeking is fun, almost addictive, when you can see your experiment results changing in real time. It gives a sense of excitement, seeing your hypotheses being proven right or wrong. More importantly, it keeps you coming back to the experiment, tracking its progress.

To solve the Taking-action-without-enough-information problem, we made it clear in our UI when it was okay to end an experiment. Specifically, this is when results become significant, or the pre-determined duration for peeking has passed.1 This changed the conversation from 'peeking early and ending experiment if results look good' to 'waiting for the green light to switch on', and led to an overall much better experience.

That's all for this post, we'd love to have you start your own experiments and tell us what you learn. Feel free to open an issue in our Github repo, join us directly for a call with our Product & Engineering team, or submit a ticket if you have feedback to share.

Subscribe to our newsletter

Product for Engineers

Read by 100,000+ founders and builders

We'll share your email with Substack