Compass Suite — Cross-Surface UX Audit (3 Evaluations)

3-SOURCE COMPARISON

CLAUDE

Claude Sonnet 4.6

Desktop

B+

78.0%

CODEX

Codex GPT-5.5

Desktop

B+

81.2%

CODEX

Codex GPT-5.5

Mobile

58.3%

SURFACE GAP

Desktop avg (79.6%) outscores Mobile (58.3%) by 21.3pp — a structural gap, not a styling gap.

PLAIN LANGUAGE READ

Two independent desktop evaluations — one by Claude, one by Codex — both confirm the desktop experience is solid to strong (B+ and A-). The same evaluator that scored the desktop A- then scored the mobile C. That's the same tool, same criteria, same evaluator — applied to two different viewports.

The mobile surface has a systemic responsive-layout failure. Navigation tabs clip off-screen. CTAs get cut off. Accessibility scores F (18.81%) on mobile while scoring B+ on desktop. The design system works. The responsive implementation doesn't.

One notable evaluator split: Claude flags vocabulary jargon (H02 C / 58%) while Codex Desktop scores it A++ (100%). Two of three evaluators flag H09 error recovery as a real problem. Mobile's H11 F-grade is confirmed by all three evaluators in different degrees.

⚠️

BEFORE SCALING TO MOBILE AUDIENCES

A first-time mobile visitor may need to scroll sideways or infer hidden labels. The same evaluator scored Desktop A- and Mobile C — a 22.9-point drop on identical criteria.

H11 Accessibility: F (18.81%) on mobile — undersized touch targets, WCAG AA contrast failures, missing ARIA roles. H08 Design: D (45.31%) on mobile — layout overflow makes CTAs unreliable. These block real users right now and should be treated as P0 bugs, not design backlog.

CRITICAL · H11 ACCESSIBILITY

Mobile F (18.81%) vs Desktop B+ (78.06%)

All three evaluators place H11 below their desktop scores once mobile context is applied. A 59-point drop between Codex Desktop and Codex Mobile on the same framework confirms this is a responsive implementation failure, not a design language flaw.

Claude Desktop: B- 68.75%

Codex Desktop: B+ 78.06%

Codex Mobile: F 18.81%

CROSS-EVALUATOR ALERT · H09 ERROR RECOVERY

2 of 3 evaluators flag error recovery as failing

Codex Desktop scored H09 at C- (50%) — identical to Codex Mobile. Claude Desktop scored B+ (75%). The 2-of-3 agreement overrides the lone optimistic reading. Error recovery pathways need structured remediation.

Claude Desktop: B+ 75%

Codex Desktop: C- 50%

Codex Mobile: C- 50%

EVALUATOR DIVERGENCE · H02 VOCABULARY

42pp gap between desktop evaluators — investigate

Claude Desktop (58.33%) and Codex Desktop (100%) diverge by the largest margin of any heuristic. Claude flags jargon (MCP, Plugin Zip, Claude Cowork) as a real barrier; Codex reads the vocabulary as appropriate for the target audience. Mobile (58.33%) aligns with Claude. Consider a user research session to resolve the split.

Claude Desktop: C 58.33%

Codex Desktop: A++ 100%

Codex Mobile: C 58.33%

Executive Summary

DESKTOP CONSENSUS STRENGTHS

H03 User Control: both desktop evals A range (95% / 85%)
H01 System Status: B+ to A+ consensus (80.56% / 94.44%)
H04 Consistency: A- to A+ (82% / 92.86%)
H06 Recognition: A range on both desktop evals
H08 Design: A on both desktops — strong visual foundation

MOBILE CRITICAL FAILURES

H11 Accessibility: F (18.81%) — urgent P0
H08 Design: D (45.31%) — layout collapse
H14 UX Writing: D- (35.11%) — label truncation
H05 Error Prevention: D (48.4%)
H09 Error Recovery: C- (50%) — confirmed on desktop too

CROSS-EVALUATOR FINDINGS

H09 Error Recovery: 2/3 evaluators flag C- or below
H02 Vocabulary: major split — user research needed
H10 Help Docs: Claude A++ vs Codex B on desktop — resolve gap
H13 Journey: desktop A- vs mobile C+ — narrative breaks on mobile
H11: all evaluators see mobile below desktop — unanimously broken

Heuristic Scorecard — 3-Source Blend

Claude · Desktop

Codex · Desktop

Codex · Mobile

H01

Visibility of System Status

A-

84.3%

Desktop +10pp

Claude · Desktop80.6%

Codex · Desktop94.4%

Codex · Mobile77.8%

H02

Match Between System and the Real World

B-

72.2%

↕ 42pp spreadDesktop +21pp

Claude · Desktop58.3%

Codex · Desktop100.0%

Codex · Mobile58.3%

H03

User Control and Freedom

A-

86.7%

Desktop +10pp

Claude · Desktop95.0%

Codex · Desktop85.0%

Codex · Mobile80.0%

H04

Consistency and Standards

A-

83.3%

Desktop +12pp

Claude · Desktop82.1%

Codex · Desktop92.9%

Codex · Mobile75.0%

H05

Error Prevention

B-

69.2%

↕ 36pp spreadDesktop +31pp

Claude · Desktop75.0%

Codex · Desktop84.2%

Codex · Mobile48.4%

H06

Recognition Rather Than Recall

B+

79.2%

Desktop +6pp

Claude · Desktop75.0%

Codex · Desktop87.5%

Codex · Mobile75.0%

H07

Flexibility and Efficiency of Use

B+

77.3%

Desktop +15pp

Claude · Desktop83.3%

Codex · Desktop81.4%

Codex · Mobile67.1%

H08

Aesthetic and Minimalist Design

73.4%

↕ 44pp spreadDesktop +42pp

Claude · Desktop85.9%

Codex · Desktop89.0%

Codex · Mobile45.3%

H09

Help Users Recognize, Diagnose, and Recover from Errors

58.3%

Desktop +12pp

Claude · Desktop75.0%

Codex · Desktop50.0%

Codex · Mobile50.0%

H10

Help and Documentation

B+

80.0%

↕ 33pp spreadDesktop +19pp

Claude · Desktop100.0%

Codex · Desktop72.5%

Codex · Mobile67.5%

H11

Accessibility and Inclusive Design

55.2%

↕ 59pp spreadDesktop +55pp

Claude · Desktop68.8%

Codex · Desktop78.1%

Codex · Mobile18.8%

H12

Empathetic Engagement

C+

63.9%

Desktop +15pp

Claude · Desktop58.3%

Codex · Desktop79.2%

Codex · Mobile54.2%

H13

Customer Journey Coherence

B-

72.7%

Desktop +14pp

Claude · Desktop83.3%

Codex · Desktop71.5%

Codex · Mobile63.2%

H14

UX Writing Quality

59.1%

↕ 36pp spreadDesktop +36pp

Claude · Desktop71.4%

Codex · Desktop70.8%

Codex · Mobile35.1%

Key Findings by Source

Claude Sonnet 4.6 — Desktop Evaluation (B+ / 78.01%)

H02 · CLAUDE DESKTOP High Impact Jargon Barrier — MCP, Plugin Zip, Claude Cowork

Technical product names used as primary navigation labels without consumer-readable explanations. Claude scored H02 at C (58.33%) — the only desktop evaluator to flag vocabulary as a structural friction point. Mobile aligns with Claude's reading (58.33%), creating a 2-vs-1 split that warrants user research to resolve.

H08 · CLAUDE DESKTOP Medium Competing CTAs Without Clear Visual Hierarchy

Multiple primary-styled buttons appear in close proximity on key pages. Desktop scores A (85.94%) overall on design, but the CTA hierarchy finding is a targeted gap within an otherwise strong visual execution.

H11 · CLAUDE DESKTOP Medium-High Accessibility: Contrast Failures and Missing Semantic Labels

Desktop scores B- (68.75%) — not failing, but not safe to scale. Several text/background combos fail WCAG 2.1 AA. Some interactive components lack semantic ARIA labels. When mobile is added, the picture worsens dramatically.

H10 · CLAUDE DESKTOP Strength Help Documentation: Perfect Score — Raise as Benchmark

Claude scores H10 at A++ (100%). Help docs are comprehensive, well-linked, and contextually placed. This is the benchmark standard the rest of the product should aspire to. Notably, Codex Desktop scored this B (72.45%) — the two desktop evaluators split 28 points here.

Codex GPT-5.5 — Desktop Evaluation (A- / 81.17%)

H04 · CODEX DESKTOP Moderate Consistency and Trust Cues Need Review

Codex's top desktop finding. Navigation, labels, and components show minor inconsistencies across screens that erode the trust pattern. Scores A+ overall (92.86%) — the inconsistency is a targeted issue, not a systemic failure. Cross-reference with Claude's similar H04 observation.

H09 · CODEX DESKTOP High Impact — Cross-Evaluator Confirmed Error Recovery: C- (50%) — Matches Mobile Exactly

Codex Desktop scores H09 at C- (50%) — the same score as Codex Mobile. This is the most alarming desktop finding: error recovery is failing on both surfaces. Claude Desktop's B+ (75%) reading is the optimistic outlier. 2-of-3 evaluators agree: error recovery needs structured remediation.

H13 · CODEX DESKTOP Medium Customer Journey Friction at Decision Points

Two of the top-5 Codex Desktop findings relate to Customer Journey (H13, 71.54% B). Decision points in the journey lack clarity and recovery affordances. This compounds on mobile where the journey narrative breaks further (63.21% C+).

H02 · CODEX DESKTOP Evaluator Split — Investigate Vocabulary Scored A++ — Disagrees with Claude and Mobile

Codex reads product vocabulary (MCP, Plugin Zip) as appropriate for the developer-oriented target audience. This is the largest single evaluator split in the audit — 41.67 percentage points from Claude. Neither reading is wrong; they reflect different user mental models. A user research session with non-developer visitors would clarify which reading is accurate for the growth market.

Codex GPT-5.5 — Mobile Evaluation (C / 58.26%)

H11 · MOBILE · P0 Critical — Fix Immediately Accessibility F — Structural Mobile Failure

Touch targets below 44px WCAG minimum, contrast failures compounding under mobile ambient conditions, missing ARIA roles on tab navigation. The 59-point gap between Codex Desktop (78.06%) and Codex Mobile (18.81%) on the same heuristic confirms this is a responsive implementation failure.

H08 · MOBILE · P0 Critical — Fix Immediately Systemic Responsive Layout Collapse

Navigation tabs clip at viewport edge on screens under ~390px. Feature section content overflows horizontally. Primary CTAs partially obscured. Users must scroll horizontally to access core navigation — an anti-pattern that signals broken layout architecture. Desktop A (89.03%) proves the design system is sound; the responsive CSS layer is not.

H14 · MOBILE High Impact Label Truncation Destroys Information

Navigation and sidebar labels written for desktop (e.g., "Advanced Audit Report Generator") truncate to meaningless fragments on viewports under 420px. Mobile UX writing requires abbreviated variants or icon+label hybrid layouts at small breakpoints. Desktop B (70.79%) is acceptable; Mobile D- (35.11%) is not.

H05 · MOBILE High Impact Error Prevention Absent at Mobile Interaction Layer

Mobile-specific error prevention (oversized tap zones, autocorrect-resistant inputs, keyboard-aware layout shifts, touch-feedback states) is not implemented. D (48.4%) vs A- (84.15%) on desktop shows this is a mobile-only gap where the desktop patterns simply were not extended.

H10 · MOBILE Medium Help Documentation Not Accessible on Mobile

The same help content that scores A++ on desktop (Claude) scores B (72.45%) for Codex Desktop and B- (67.45%) for mobile. Multi-column doc layouts, wide reference tables, and collapsible sidebars do not adapt to mobile viewports. The content is there — the presentation fails on small screens.

Remediation Triage Matrix

Scores shown as: Claude Desktop / Codex Desktop / Codex Mobile

Heuristic	Area	C·D / X·D / X·M	Owner	Timeline	Impact
🟠 H11	Accessibility	69 / 78 / 19	Mobile dev + a11y lead	Sprint 1	Mobile F → touch targets, contrast, ARIA roles
🟠 H09	Error Recovery	75 / 50 / 50	Dev + content	Sprint 1	2-of-3 evaluators flag this; mobile path broken
🟢 H08	Aesthetic Design	86 / 89 / 45	Mobile / responsive dev	Sprint 1–2	Mobile D — layout clipping; desktop A- baseline
🟠 H14	UX Writing	71 / 71 / 35	Content + front-end	Sprint 1–2	Mobile D- label truncation; all sources agree B range on desktop
🟡 H05	Error Prevention	75 / 84 / 48	Mobile dev	Sprint 2	Mobile D — no mobile error states
🟡 H02	System ↔ Real World	58 / 100 / 58	Content / product	Sprint 2	Evaluator split: glossary strategy recommended
🟡 H12	Empathetic Engagement	58 / 79 / 54	Design + content	Sprint 2–3	Onboarding tone; progressive jargon disclosure
🟡 H13	Customer Journey	83 / 72 / 63	Design + product	Sprint 3	Mobile C+ breaks desktop A- narrative arc
🟢 H10	Help & Documentation	100 / 72 / 67	Content	Sprint 3	Desktop split (A++ vs B) — mobile-proof docs
🟢 H04	Consistency	82 / 93 / 75	Design system	Sprint 3	Mobile slippage vs strong desktop; button audit

🔴 Critical (<47%)🟠 High (<60%)🟡 Medium (<73%)🟢 Monitor (≥73%)

What's Working — Confirmed Across Multiple Evaluations

H03 USER CONTROL · A / A++ / A-

All three evaluations score H03 in the A range. Undo paths, cancel options, and navigation escape hatches are reliably present across desktop and mobile.

H01 SYSTEM STATUS · B+ / A+ / B+

Status visibility scores well across all three evaluations — one of the few areas where mobile performance matches desktop. Loading states and feedback loops work.

H08 DESIGN · DESKTOP · A / A

Both desktop evaluators score H08 in the A range (85.94% / 89.03%). The design system is strong. The mobile failure (D / 45.31%) is a responsive CSS implementation issue — not a design language failure.

H04 CONSISTENCY · A- / A+ / B+

Desktop consistency is a genuine strength — especially for Codex Desktop (A+ / 92.86%). Even mobile scores B+ (75.01%), meaning consistency principles are more robustly applied than most heuristics.

H06 RECOGNITION · B / A / B+

Familiar UI patterns, predictable component placement, and icon conventions work well on both surfaces. The vocabulary debate (H02) is separate from recognition — patterns are legible even if labels are jargon-heavy.

H10 HELP DOCS · A++ / B / B-

Claude's A++ (100%) is a standout result. Even the lower Codex Desktop reading (B / 72.45%) and mobile reading (B- / 67.45%) represent acceptable baselines. The content foundation is strong; mobile delivery needs work.

Recommended Roadmap

SPRINT 1 — CRITICAL (Weeks 1–2)

H11 Mobile Accessibility: WCAG AA audit; 44px minimum touch targets; fix contrast failures; add ARIA roles to all interactive components including tab navigation
H08 Responsive Layout: Fix nav tab overflow at all breakpoints 320–768px; ensure CTAs are never clipped; test horizontal scroll elimination
H09 Error Recovery (Desktop + Mobile): Structured remediation — 2 of 3 evaluators flag this. Mobile error recovery path, desktop help anchor links

SPRINT 2 — HIGH PRIORITY (Weeks 3–5)

H14 Label Shortening: Abbreviated label variants for all nav items >20 chars; icon+label layout at ≤420px breakpoint
H05 Mobile Error Prevention: Enlarged tap zones; input validation before submission; keyboard-aware layout shifts
H02 Vocabulary Resolution: Commission user research session with non-developer target users to settle the evaluator split; implement glossary if research confirms Claude's reading
H12 Empathetic Onboarding: Reframe hero copy for non-developer curiosity; add outcome framing before feature listing

SPRINT 3 — MEDIUM PRIORITY (Weeks 6–10)

H10 Mobile Help: Responsive doc layouts; single-column mobile format; contextual help anchors on product pages
H13 Journey Continuity: Ensure mobile navigation preserves the desktop journey narrative; bottom-nav pattern for primary actions
H04 Design System Responsive Audit: Apply button color and spacing tokens consistently at all breakpoints
H08 CTA Hierarchy: One dominant primary CTA per section; secondary actions visually demoted

ONGOING — MONITOR

H03 User Control: All three evaluators rate this A range. Regression-test with each navigation change.
H01 System Status: Solid cross-surface. Extend mobile loading state patterns as responsive fixes land.
H04 Consistency: Strong on desktop. Audit each new component addition for responsive token application.

Methodology

Evaluation Sources

Claude Sonnet 4.6 — Desktop (viewport ≥1024px). 102 items, 14 heuristics. B+ / 78.01%.

Codex GPT-5.5 — Desktop (viewport ≥1024px). 102 items, 14 heuristics. A- / 81.17%.

Codex GPT-5.5 — Mobile (viewport ≤430px). 102 items, 14 heuristics. C / 58.26%.

Scoring Formula

Quality % = ((4 − avg_severity) / 4) × 100

Severity: 0 None, 1 Cosmetic, 2 Minor, 3 Major, 4 Catastrophic

Blended = (Claude Desktop + Codex Desktop + Codex Mobile) / 3

Overall = mean of all 14 blended heuristic scores

Grade Scale

A+ ≥95%

A ≥90%

A- ≥83%

B+ ≥77%

B ≥73%

B- ≥66%

C+ ≥60%

C ≥54%

C- ≥47%

D+ ≥40%

F ≥0%

Heuristic Framework

H01–H10: Nielsen Norman 10 Usability Heuristics.

H11: Accessibility and Inclusive Design (extended).

H12–H14: UX Heuristic Compass extended framework — Empathetic Engagement, Customer Journey Coherence, UX Writing Quality.

Next Steps

WEEK 1

H11 + H08 sprint kick-off. WCAG audit. Fix touch targets, contrast, nav tab overflow.

WEEK 2

H09 error recovery remediation on both surfaces. Schedule H02 vocabulary user research session.

WEEK 3–4

H14 label shortening. H05 mobile error prevention. H12 onboarding rewrite based on research.

WEEK 5+

Re-audit mobile surface after Sprint 1–2. Target: mobile above 70% (from 58.26%). Close the 21pp desktop/mobile gap by 50%.

Compass Suite UX Audit — 3-Source Blended Score 72.5% · B- · 2026-05-03
Claude Sonnet 4.6 (Desktop) · Codex GPT-5.5 (Desktop) · Codex GPT-5.5 (Mobile) · UX Heuristic Compass Framework