One Tuesday I wasted two hours chasing a Bitrix24 (it's ERP/CRM platform) API method that doesn't exist. The model I asked described it like it was right there in the docs - full description, code example, confident tone. The method was crm.item.userfield.add. Made up.
The real one is userfieldconfig.add. It's in the official documentation.
That evening I kept thinking about one thing: what if I could see when models disagree? Not which one is right - I won't always know. Just a signal. Something's off here, check before you use it.
So I built a tool. Three models, same question, in parallel. Watch where they split. I added an interface, then more features, then other people started using it. Now it's a product, which still feels weird to say about something I built for my own Tuesday afternoons.
A few weeks ago I ran a benchmark - 60 questions, half general knowledge, half narrow technical (specific API methods, library behavior, niche platforms).
General questions: median consensus 92.5. Models hedge on subjective questions and tend to say the same things in slightly different words.
Technical questions: median consensus 33.
The Bitrix case is the clearest example. Question: how do you create a custom user field for a smart process in Bitrix24?
Three answers:
- Model 1:
crm.item.userfield.add - Model 2:
crm.userfield.add - Model 3:
userfieldconfig.add
I checked all three against the official docs. Only one - userfieldconfig.add - was the right method for smart processes. The other two were either invented or borrowed from a different part of the API where they don't apply.
All three answered with the same confident tone. No hedging, no uncertainty. If you'd asked just one and gotten a wrong answer, you'd have had no reason not to trust it.
Worth being precise about what the consensus score means.
It doesn't tell you which answer is correct - the synthesizer model underneath doesn't have access to ground truth either. It tells you something simpler: when three independently queried models converge, you're asking about something well-covered in training data. When they diverge, the data is thin or inconsistent, and at least one model is guessing.
33% consensus means three models, three different answers. Someone's wrong. Probably two of them.
General questions cluster at 90–95%. That's just well-covered territory, not a useful signal either way. The outliers are what matter - specific API methods, recent spec changes, niche platform behavior. These appear rarely enough in training data that different models develop different "memories" of the same fact.
You can't fix this by switching to a better model. It's a triangulation problem.
I'm a single developer. I built this because I kept running into the same specific thing - not just "AI got it wrong," but "AI got it wrong and sounded exactly as confident as when it gets it right." That's hard to work around without a cross-check.
Free tier: 3 queries - try it on something you've been trusting one model for. Founding tier: $9/month for the first 100 people, price locked for 3 years.






















