Two independent desktop evaluations — one by Claude, one by Codex — both confirm the desktop experience is solid to strong (B+ and A-). The same evaluator that scored the desktop A- then scored the mobile C. That's the same tool, same criteria, same evaluator — applied to two different viewports.
The mobile surface has a systemic responsive-layout failure. Navigation tabs clip off-screen. CTAs get cut off. Accessibility scores F (18.81%) on mobile while scoring B+ on desktop. The design system works. The responsive implementation doesn't.
One notable evaluator split: Claude flags vocabulary jargon (H02 C / 58%) while Codex Desktop scores it A++ (100%). Two of three evaluators flag H09 error recovery as a real problem. Mobile's H11 F-grade is confirmed by all three evaluators in different degrees.
A first-time mobile visitor may need to scroll sideways or infer hidden labels. The same evaluator scored Desktop A- and Mobile C — a 22.9-point drop on identical criteria.
H11 Accessibility: F (18.81%) on mobile — undersized touch targets, WCAG AA contrast failures, missing ARIA roles. H08 Design: D (45.31%) on mobile — layout overflow makes CTAs unreliable. These block real users right now and should be treated as P0 bugs, not design backlog.
Mobile F (18.81%) vs Desktop B+ (78.06%)
All three evaluators place H11 below their desktop scores once mobile context is applied. A 59-point drop between Codex Desktop and Codex Mobile on the same framework confirms this is a responsive implementation failure, not a design language flaw.
2 of 3 evaluators flag error recovery as failing
Codex Desktop scored H09 at C- (50%) — identical to Codex Mobile. Claude Desktop scored B+ (75%). The 2-of-3 agreement overrides the lone optimistic reading. Error recovery pathways need structured remediation.
42pp gap between desktop evaluators — investigate
Claude Desktop (58.33%) and Codex Desktop (100%) diverge by the largest margin of any heuristic. Claude flags jargon (MCP, Plugin Zip, Claude Cowork) as a real barrier; Codex reads the vocabulary as appropriate for the target audience. Mobile (58.33%) aligns with Claude. Consider a user research session to resolve the split.
Technical product names used as primary navigation labels without consumer-readable explanations. Claude scored H02 at C (58.33%) — the only desktop evaluator to flag vocabulary as a structural friction point. Mobile aligns with Claude's reading (58.33%), creating a 2-vs-1 split that warrants user research to resolve.
Multiple primary-styled buttons appear in close proximity on key pages. Desktop scores A (85.94%) overall on design, but the CTA hierarchy finding is a targeted gap within an otherwise strong visual execution.
Desktop scores B- (68.75%) — not failing, but not safe to scale. Several text/background combos fail WCAG 2.1 AA. Some interactive components lack semantic ARIA labels. When mobile is added, the picture worsens dramatically.
Claude scores H10 at A++ (100%). Help docs are comprehensive, well-linked, and contextually placed. This is the benchmark standard the rest of the product should aspire to. Notably, Codex Desktop scored this B (72.45%) — the two desktop evaluators split 28 points here.
Codex's top desktop finding. Navigation, labels, and components show minor inconsistencies across screens that erode the trust pattern. Scores A+ overall (92.86%) — the inconsistency is a targeted issue, not a systemic failure. Cross-reference with Claude's similar H04 observation.
Codex Desktop scores H09 at C- (50%) — the same score as Codex Mobile. This is the most alarming desktop finding: error recovery is failing on both surfaces. Claude Desktop's B+ (75%) reading is the optimistic outlier. 2-of-3 evaluators agree: error recovery needs structured remediation.
Two of the top-5 Codex Desktop findings relate to Customer Journey (H13, 71.54% B). Decision points in the journey lack clarity and recovery affordances. This compounds on mobile where the journey narrative breaks further (63.21% C+).
Codex reads product vocabulary (MCP, Plugin Zip) as appropriate for the developer-oriented target audience. This is the largest single evaluator split in the audit — 41.67 percentage points from Claude. Neither reading is wrong; they reflect different user mental models. A user research session with non-developer visitors would clarify which reading is accurate for the growth market.
Touch targets below 44px WCAG minimum, contrast failures compounding under mobile ambient conditions, missing ARIA roles on tab navigation. The 59-point gap between Codex Desktop (78.06%) and Codex Mobile (18.81%) on the same heuristic confirms this is a responsive implementation failure.
Navigation tabs clip at viewport edge on screens under ~390px. Feature section content overflows horizontally. Primary CTAs partially obscured. Users must scroll horizontally to access core navigation — an anti-pattern that signals broken layout architecture. Desktop A (89.03%) proves the design system is sound; the responsive CSS layer is not.
Navigation and sidebar labels written for desktop (e.g., "Advanced Audit Report Generator") truncate to meaningless fragments on viewports under 420px. Mobile UX writing requires abbreviated variants or icon+label hybrid layouts at small breakpoints. Desktop B (70.79%) is acceptable; Mobile D- (35.11%) is not.
Mobile-specific error prevention (oversized tap zones, autocorrect-resistant inputs, keyboard-aware layout shifts, touch-feedback states) is not implemented. D (48.4%) vs A- (84.15%) on desktop shows this is a mobile-only gap where the desktop patterns simply were not extended.
The same help content that scores A++ on desktop (Claude) scores B (72.45%) for Codex Desktop and B- (67.45%) for mobile. Multi-column doc layouts, wide reference tables, and collapsible sidebars do not adapt to mobile viewports. The content is there — the presentation fails on small screens.
| Heuristic | Area | C·D / X·D / X·M | Owner | Timeline | Impact |
|---|---|---|---|---|---|
| 🟠 H11 | Accessibility | 69 / 78 / 19 | Mobile dev + a11y lead | Sprint 1 | Mobile F → touch targets, contrast, ARIA roles |
| 🟠 H09 | Error Recovery | 75 / 50 / 50 | Dev + content | Sprint 1 | 2-of-3 evaluators flag this; mobile path broken |
| 🟢 H08 | Aesthetic Design | 86 / 89 / 45 | Mobile / responsive dev | Sprint 1–2 | Mobile D — layout clipping; desktop A- baseline |
| 🟠 H14 | UX Writing | 71 / 71 / 35 | Content + front-end | Sprint 1–2 | Mobile D- label truncation; all sources agree B range on desktop |
| 🟡 H05 | Error Prevention | 75 / 84 / 48 | Mobile dev | Sprint 2 | Mobile D — no mobile error states |
| 🟡 H02 | System ↔ Real World | 58 / 100 / 58 | Content / product | Sprint 2 | Evaluator split: glossary strategy recommended |
| 🟡 H12 | Empathetic Engagement | 58 / 79 / 54 | Design + content | Sprint 2–3 | Onboarding tone; progressive jargon disclosure |
| 🟡 H13 | Customer Journey | 83 / 72 / 63 | Design + product | Sprint 3 | Mobile C+ breaks desktop A- narrative arc |
| 🟢 H10 | Help & Documentation | 100 / 72 / 67 | Content | Sprint 3 | Desktop split (A++ vs B) — mobile-proof docs |
| 🟢 H04 | Consistency | 82 / 93 / 75 | Design system | Sprint 3 | Mobile slippage vs strong desktop; button audit |
All three evaluations score H03 in the A range. Undo paths, cancel options, and navigation escape hatches are reliably present across desktop and mobile.
Status visibility scores well across all three evaluations — one of the few areas where mobile performance matches desktop. Loading states and feedback loops work.
Both desktop evaluators score H08 in the A range (85.94% / 89.03%). The design system is strong. The mobile failure (D / 45.31%) is a responsive CSS implementation issue — not a design language failure.
Desktop consistency is a genuine strength — especially for Codex Desktop (A+ / 92.86%). Even mobile scores B+ (75.01%), meaning consistency principles are more robustly applied than most heuristics.
Familiar UI patterns, predictable component placement, and icon conventions work well on both surfaces. The vocabulary debate (H02) is separate from recognition — patterns are legible even if labels are jargon-heavy.
Claude's A++ (100%) is a standout result. Even the lower Codex Desktop reading (B / 72.45%) and mobile reading (B- / 67.45%) represent acceptable baselines. The content foundation is strong; mobile delivery needs work.
Quality % = ((4 − avg_severity) / 4) × 100
Severity: 0 None, 1 Cosmetic, 2 Minor, 3 Major, 4 Catastrophic
Blended = (Claude Desktop + Codex Desktop + Codex Mobile) / 3
Overall = mean of all 14 blended heuristic scores
H11 + H08 sprint kick-off. WCAG audit. Fix touch targets, contrast, nav tab overflow.
H09 error recovery remediation on both surfaces. Schedule H02 vocabulary user research session.
H14 label shortening. H05 mobile error prevention. H12 onboarding rewrite based on research.
Re-audit mobile surface after Sprint 1–2. Target: mobile above 70% (from 58.26%). Close the 21pp desktop/mobile gap by 50%.