Building a Clinical Speech-Therapy App With a Real SLP: 4 Lessons From PhoenixSteps

Originally published on the AstroLexis blog. Cross-posted here for the community.

My son's speech-language pathologist became my co-founder. PhoenixSteps is what came out of it: a pediatric clinical app that does what existing apps don't because we built it together — engineer plus therapist plus actual patient (also my kid). Here are four lessons from the last six months, including how we taught Apple's Vision framework to do something Apple flatly refused to.

How this started

My son has a speech sound disorder. Specifically, rotacismo — he struggles with /r/ and /rr/, which in Spanish are foundational phonemes that show up in roughly one in every six words. His speech-language pathologist is Stefania. We've been seeing her weekly for over a year and the progress has been real, but inconsistent: he'd nail a sound during a session and lose it by mid-week.

The gap was obvious to both of us. He'd do exercises with Stefa for forty minutes, then we'd go home and the exercises mostly stopped, because:

The "drill at home" sheet Stefa sent had no feedback loop. My kid would say "ratón" five times and have no idea if any of them were correct.
Existing pediatric speech-therapy apps in Spanish are either commercially mediocre (gamified versions of basic flashcards) or clinically rigid (built for adult speech rehab, not children).
The market for tools that actually run the clinical exercises a Spanish-speaking SLP would prescribe — with audio feedback, automatic scoring, and progress tracking the therapist can read — basically did not exist for a private practice working with a 4-year-old.

I asked Stefa if she'd want to co-design something. She said yes. That's how PhoenixSteps started — and the four lessons below are the ones I wish I'd known going in.

Lesson 1: A clinical co-creator changes everything about what you ship

I had built consumer iOS apps before. I had not built a clinical tool. The thing I underestimated was how much of the actual product is the protocol, not the software.

Stefa works from named, published clinical protocols — Borrás, Bosch, the AELFA articulation drills. When she prescribes an exercise, she's pulling from a tradition that has decades of consensus on order, dosage, and progression. "Lengua a la nariz" isn't a cute idea — it's Borrás Exercise 29, with specific instructions about duration, repetitions per day, and what to do if the child can't sustain the position.

Before working with Stefa, I would have built a "speech therapy app" that was basically a glorified flashcard deck with cute animations. With Stefa, the exercise catalog became:

Orofacial praxias — 7 exercises pulled directly from her clinical sheet, in the order she actually prescribes them.
R-group syllable warmups — "ra ra ra," "rrrr-on" — building muscle memory before tackling words.
R simple words — rosa, ratón, mira, perro — graded by Stefa for difficulty.
R-cluster words (sinfones) — bra, cra, dra, fra, gra, pra, tra. The hard ones.
Minimal pairs — R/RR, R/L, D/R, T/D. Auditory discrimination drills.
Carrier phrases — embedding the target sound in real sentences.
"Tren de la Risa" — a karaoke song Stefa wrote that hits every R context across 8 verses.

None of that comes out of an engineer's imagination. It comes out of a working SLP's notebook.

Lesson 1 distilled: if you're building a clinical product, the clinician is not a "domain advisor." They're a co-founder. Hire them, equity them in, give them a real voice on the product roadmap.

Lesson 2: Apple won't give you what your patient needs. Build it yourself.

This is the technical story, and it's the one I'm most proud of.

One of the most prescribed praxias for kids working on /r/ is "lengua a la nariz" — extending the tongue tip toward the nose. The exercise builds the lingual elevation needed for the alveolar trill. Stefa wants the app to automatically verify the kid did the exercise correctly: tongue out, pointed up, sustained for 10 seconds.

This sounds like a job for ARKit. Apple has had face tracking with the TrueDepth camera since the iPhone X. ARFaceAnchor.blendShapes includes jawOpen, mouthSmileLeft, cheekPuff — and yes, tongueOut.

Except: tongueOut is a scalar. It's 0 when the tongue is in, and 1 when it's out. Apple does not tell you where the tongue is pointing. Up, down, left, right — they all read identical.

I emailed Apple developer support. The answer was: no, the tongue is not modeled as 3D geometry, and there's no API to detect tongue direction. Tongue tracking is inherently unstable (occlusion by teeth and lips), so Apple chose not to ship something they couldn't validate at Face ID precision.

So Stefa and I built the detector ourselves.

The pipeline

ARKit captures the camera frame on the TrueDepth camera at 60 fps.
We grab the raw frame.capturedImage — the YUV pixel buffer ARKit hands you for free.
Vision detects face landmarks: VNDetectFaceLandmarksRequest returns outerLips, innerLips, and nose as 2D polygons.
Three Regions of Interest outside the lip polygon:
- UP ROI — rectangle between top of upper lip and bottom of nose
- LEFT ROI — extending leftward from the left corner of the lips
- RIGHT ROI — same, mirrored
Count pink/red pixels inside each ROI. The lip-skin transition is at Cr ≈ 18; the tongue is at Cr ≈ 25-50. We threshold Cr > 25 to filter out facial skin and pale lips.
If a ROI has > 400 "tongue-colored" pixels, the tongue is projecting in that direction. Cross-check with ARKit's tongueOut blendshape, mirror-compensate for the front-facing camera.

The detector reports up, down, left, right, center, or notVisible at 20Hz with a confidence score. The first time I showed Stefa the demo — me sticking my tongue toward my nose and watching the screen say "ARRIBA conf 100% pix 3,974" — she didn't believe it was real until I sent her the source code.

Lesson 2 distilled: the most defensible technical work in a clinical product is the part Apple won't ship. If you can do something the platform doesn't expose — and it matters for the clinical outcome — that's your moat.

Lesson 3: Audio quality is a feature, not a detail

PhoenixSteps ships with about 325 pre-recorded voice prompts, all generated using OpenAI's gpt-4o-mini-tts with the "nova" voice. Why pre-recorded TTS instead of letting iOS synthesize on the fly?

Pediatric voice consistency. Kids learn faster when the audio prompt sounds the same every time.
Speed and articulation. Stefa wanted slower-than-normal pronunciation for warmups, regular pace for practice, a specific cadence for the song. Generating with explicit instructions ("habla en español neutro latinoamericano, ritmo lento y articulado, énfasis infantil sin caricaturizar") gets us the exact register a real SLP would use.
Reliability. Pre-recorded audio works offline, doesn't depend on a phone's TTS pipeline being up, doesn't get interrupted by Siri.

We learned the hard way that the OpenAI API will occasionally return a truncated mp3 (we caught three files at 0.36s when they should have been 1.2s). The fix was a post-generation validation step: every newly generated mp3 has to pass a minimum-duration check.

Lesson 3 distilled: for pediatric/clinical apps, audio is content. Pre-render every prompt with a consistent voice and pace. Validate audio duration before bundling.

Lesson 4: HIPAA-equivalent privacy isn't optional

The users of PhoenixSteps are children. Their voice recordings and progress data are protected health information.

Speech recognition on-device (WhisperKit). Voice never leaves the iPhone.
Face tracking on-device (ARKit + Vision).
Progress data in SwiftData, syncing to family's private iCloud.
No analytics, no third-party SDKs, no Crashlytics, no Facebook Pixel.
AI features gated by parental consent. Apple Foundation Models on-device, opt-in.

PhoenixSteps will never have a data breach involving children's voice samples, because there's no centralized data to breach.

Lesson 4 distilled: if you're building anything where the user is a minor or a patient, design as if the audit is happening tomorrow.

Where PhoenixSteps is right now

Not in the App Store yet. Build 28. Finishing the clinical pilot with Stefa.
Spanish-first. English localization on the roadmap once the clinical content is validated by an English-speaking SLP.
Free for parents, with an optional Pro tier for clinicians.
Stefa is a co-founder. Equity, not consulting.

If you're an SLP working with pediatric patients in Spanish, write us. We're going to add more clinical advisors as the product matures: contact@astrolexis.space.

— Bruno Galtranch, founder, AstroLexis LLC. With Stefania, SLP and co-founder.

推荐订阅源

DEV Community