Why We Banned 'Within the Realm of...' From Our AI Game Descriptions

A small portal's accidental brush with Google's scaled-content-abuse algorithm — and the 11-rule prompt change that fixed it.

TL;DR

We run a 6,800-game HTML5 portal where every game description is generated by GPT-4.1-mini using a prompt that tries to sound editorial. Last week we caught the prompt mid-disaster: 56% of descriptions across our entire corpus opened with the same six words ("What separates casual from committed play..."), 38% used the same skill-tier framing ("Veterans of [genre] recognize..."), and 86% violated our internal jargon budget by stuffing four or more of hitbox, frame-pacing, tick rate, RNG floor into a single page.

A wedding-dress-up game on our site discussed tick-rate. Dress-up games don't have a tick-rate. That's when we knew we'd built a textbook scaled-doorway pattern straight into the GPT prompt — exactly what Google's March 2024 spam-policy update explicitly called out:

"Scaled content abuse is when many pages are generated for the primary purpose of manipulating search rankings... It does not matter if you use generative AI, manual means, or a mix to produce this scaled content."

Our indexing rate was 18.6%. The pattern was both the cause and the diagnostic.

This post is what we found, what we changed, and the verification script we now run in CI to prevent regression.

How a "neutral expert" prompt becomes a doorway pattern

The original prompt was good-faith. We wanted descriptions that read like a reviewer wrote them, not like a marketer. To anchor the model away from generic Adventure-of-a-lifetime copy, we provided positive examples of "neutral-expert voice patterns" the model could draw from. Six of them, including:

"Veterans of [X] recognize..."
"Serious players of the genre find..."
"What separates casual from committed play is..."
"A commonly overlooked mechanic..."
"The game rewards a counter-intuitive approach..."

We also asked for a tail line Expert Tip: ..., a counter-intuitive truth, and "genre jargon used unapologetically" with examples of hitbox, tick rate, frame-pacing, metagame, RNG floor.

What happened: the model picked the path of least resistance. Each of the six voice patterns is high-quality on its own. But across 6,800 generations, the model converged on a tiny subset. It's the LLM equivalent of giving a student six sample answers and being surprised they only ever quote the first three.

We discovered the problem when we wrote a homogeneity check (more on that below) and learned that 3,797 of our 6,728 entries opened with the same eleven-word sentence. We had effectively templated 56% of our content corpus.

What this looks like to Google

Google's structured-content classifier doesn't need to "understand" voice patterns — n-gram fingerprinting is enough. When 56% of pages on a domain share an opener of >40 characters, that domain pattern-matches with the documented "scaled content" spam category from the March 2024 update. The algorithmic response is documented and predictable:

Google still crawls each page (the URLs are in our sitemap).
Google evaluates the content quality signal at the domain level.
Pages that fingerprint as "templated content" land in "Crawled — currently not indexed" in Search Console.

The 16-million-page IndexCheckr study puts the cross-web indexing average at 37%. We were at 18.6%. GSQI's December 2024 spam-update case studies include a directory site with 140k programmatic URLs at 12% effective indexing rate — same shape, larger scale.

The diagnostic is unambiguous: when Crawled, not indexed dominates your index report and your domain-wide content shares a structural fingerprint, the algorithm has evaluated and rejected your content, not failed to find it.

The 11-rule v3 prompt

We rewrote the system prompt with anti-doorway constraints as first-class. Here's the structure (full prompt is in our open-source repo):

Hard constraints (kept from v2):

No first-person pronouns — write as a critic, not a player.
No em-dashes (we sanitize these but reject if they leak past).
No marketing filler ("amazing", "unbelievable adventure", "dive into", "embark on").
No AI connector words ("In summary", "Furthermore", "It is worth noting").
No invented numbers or personal events ("scored 4,500", "on level 7").
Game name appears max 3 times in 180 words.

Anti-doorway diversity (new in v3):

BANNED OPENING TEMPLATES — the description must NOT begin with: "Within the realm of...", "Within the crowded field of...", "Within the niche of...", "Across the field of...", "Among browser-based...", "In the world of...", or any near-paraphrase of "[Prep] the [realm/field/niche/world] of [genre]".
BANNED VOICE PATTERNS — do NOT use anywhere: "Veterans of [X] recognize...", "Veterans of the genre...", "Serious players of the genre find...", "What separates casual from committed play...", "Players of [X] often discover...". The same ideas may be expressed, but rephrased every time using the game's specific mechanics.
JARGON BUDGET — across the entire description, use AT MOST 2 of: hitbox, tick rate, frame-pacing, metagame, RNG floor, input lag. Default to 0–1. Pick only what is actually visible in the game.
OPENING ANCHOR — the first sentence anchors on something CONCRETE and game-specific (not generic genre framing). Pick one of six modes: (a) Concrete action/mechanic, (b) Visual/scene specifics, (c) Input/control feel, (d) Design-choice observation, (e) Physics/numerical constraint, (f) Contrast/contradiction.
NAME ANCHORING — the game name must appear in the first 80 characters.

Server-side rejection: after generation, we regex-test the output against the banned openings/voices and the jargon-count budget. Any violation is rejected and the generation retried.

What v3 actually produces

A handful of side-by-side examples (real games, real outputs):

Game: urban-echo (parkour runner)

v2: "Within the realm of stylized urban runner-platformers, this title positions itself as a deceptively straightforward exercise in rooftop momentum. Veterans of the genre recognize that what separates casual from committed play..."
v3: "Sliding under a low-hanging pipe in Urban Echo requires split-second timing that often outweighs raw speed. This browser-based parkour game challenges players to navigate a series of urban rooftops..."

Game: bazooka-survivors (bullet-hell arena)

v2: "Among browser-based bullet hell shooters, this title stands out by combining relentless enemy waves with surprisingly deliberate pacing..."
v3: "Swarms of enemies press relentlessly in Bazooka Survivors, forcing rapid decisions amid chaotic bullet patterns. This arcade shooter demands not only quick reflexes but also strategic positioning..."

Game: granny-the-game (stealth horror)

v2: "Within the claustrophobic confines of a decrepit house, players face a tense stealth challenge that hinges on sound as much as sight..."
v3: "Granny: 5-day stealth horror in a creepy house. Find the keys, avoid Granny's hearing radius, escape before sundown. Browser play, no download required."

The v3 outputs share no opening fingerprint across pages of different genres. They mention specific game mechanics (low-hanging pipe, hearing radius, 5-day timer) that the model could only emit by actually engaging with the prompt's reference to the source material.

The verification script (run this in CI)

We open-sourced the homogeneity check. It does five things:

Counts banned-opening hits across the corpus
Counts banned-voice phrase occurrences
Builds a normalized 50-character prefix per entry (strips game name + digits) and reports duplicates
Counts jargon-term overuse per entry (>2 = budget exceeded)
Optional --strict mode that exits 1 if any threshold fails — perfect for CI gates

Sample output on our v2 corpus before the fix:

[1] Banned opening frequency (target: each < 5% of corpus)
  FAIL Within the realm of: 1576 / 6728 (23.42%)
  FAIL Within the crowded field of: 428 / 6728 (6.36%)

[2] Banned voice phrase frequency (target: each < 5%)
  FAIL Veterans of [X] recognize: 2571 (38.21%)
  FAIL What separates casual from committed play: 3804 (56.54%)

[3] Top 12 normalized prefixes:
  (70x | 1.04%) "within the realm of browser puzzle games this titl"
  (56x | 0.83%) "within the realm of browser based games this title"

Same script on the v3 entries we've migrated:

--strict: PASSED

There's a subtle lesson in step 3 of that script. When we first wrote it, the prefix-uniqueness test fired a false positive on small samples — every prefix in an 11-entry test set hit 1/11 = 9.09%, which our naïve threshold flagged as failure. The fix is to require actual duplication (count >= 2) before any percentage threshold matters. Single occurrence is unique by definition. We learned this on a real CI failure that wasted ~$2 of API calls — not catastrophic, but a nice reminder that homogeneity checks need to distinguish "same prefix" from "small sample size".

What this hasn't fixed yet (honesty section)

The v3 prompt is correct. But only a small fraction of our 6,820-entry corpus has been migrated as of writing. We're running batched migration via GitHub Actions (~$0.30 per 100-slug batch). At our current cadence we'll hit 50% v3 coverage in about three weeks.

Why not all-at-once? Two reasons:

Cost: ~$30 to redo the whole corpus + es/pt/fr translations. Manageable, but not casual.
Risk: a single all-at-once rewrite means a single deploy where the entire site's content changes. If the v3 prompt has any latent issue we haven't found, it would ship to all 6,820 pages simultaneously. Batched migration gives us four daily checkpoints to verify the v3 corpus's homogeneity score before committing to the next batch.

We're also not naïvely expecting a linear recovery. Google's domain-level quality assessment is built from the pages it has crawled. To shift the assessment, we need to change enough of the corpus that the next time Google re-evaluates the domain, the n-gram fingerprint has measurably moved. The literature suggests 4–8 weeks after the corpus reaches >50% v3 before the indexing rate visibly responds.

What we'd do differently

Bake banned patterns into the prompt from day one, not as positive examples. The mistake we made was thinking "giving the model six varied voice patterns will produce variety". The model doesn't pick uniformly from six options across 6,800 generations — it picks the easiest one. The fix is to prevent high-frequency openers, not to suggest alternatives.

Treat the homogeneity check as a CI gate, not a post-hoc audit. If we'd had this in place at v2, we would have caught the 56% same-opener rate after generation #500, not after #6,728. Cheap dollars vs expensive ones.

Don't trust LLM "creativity" claims. The v2 prompt was 480 words long and produced a 56%-identical opener. The v3 prompt is 730 words long, with 11 explicit constraints. Most of those constraints are negative ("don't do X"). LLMs need negative space defined; positive examples are not enough.

If your portal or content site uses an LLM-driven descriptions pipeline, run a homogeneity check on your corpus today. Three minutes. If it passes, great. If it fails the way ours did, you have a working diagnosis before the next core update.

The site this lesson came from: DooDoo.Love — a free HTML5 browser games portal with 6,800+ titles.

About the author: Steuber Alberto is the editor at DooDoo.Love. Reach me at support@doodoo.love.

推荐订阅源

DEV Community