mecheval.
mecheval.
AI builds the mech. We measure how badly it fits.
mechanical, physical, and CAD evaluation suite for AI models · pass5
OPERATOR says: only Opus 4.7 (direct) has passes so far (4 tasks).
Models
| model | tasks | runs | pass^5 | score | tokens | wall |
|---|---|---|---|---|---|---|
| claude-direct-claude-opus-4-7 | 5 | 25 | 4/5 | 0.97 | 1.5k | 6.5s |
| claude-mcp-claude-opus-4-7 | 5 | 5 | — | 0.59 | 497.7k | 105.2s |
Task × model matrix
| task ↓ · model → | claude-direct-claude-opus-4-7 | claude-mcp-claude-opus-4-7 |
|---|---|---|
| a1-cube-01 | PASS 1.00 | 1/1* 1.00 |
| a1-plate-01 | 0/5 0.83 | 0/1* 0.33 |
| a1-stepped-shaft-01 | PASS 1.00 | 1/1* 1.00 |
| a2-flanged-cap-01 | PASS 1.00 | 0/1* 0.43 |
| a2-l-bracket-01 | PASS 1.00 | 0/1* 0.17 |
| c-reacher-01 | — | — |
Cost · score Pareto
tokens (log) vs mean score across the most recent 5 attempts. Each dot is one (model, task) pair; click to drill into the task page.
wall-clock seconds (log) vs mean score. The left edge is fast; the right edge is patient.
Per task · per model
| model | task | attempts | pass^5 | score | tokens | wall |
|---|---|---|---|---|---|---|
| claude-direct-claude-opus-4-7 | a1-cube-01 | 5 | PASS | 1.00 | 961 | 2.7s |
| claude-direct-claude-opus-4-7 | a1-plate-01 | 5 | 0/5 | 0.83 | 1.5k | 7.1s |
| claude-direct-claude-opus-4-7 | a1-stepped-shaft-01 | 5 | PASS | 1.00 | 1.2k | 3.9s |
| claude-direct-claude-opus-4-7 | a2-flanged-cap-01 | 5 | PASS | 1.00 | 2.2k | 12.2s |
| claude-direct-claude-opus-4-7 | a2-l-bracket-01 | 5 | PASS | 1.00 | 1.6k | 6.6s |
| claude-mcp-claude-opus-4-7 | a1-cube-01 | 1 | 1/1* | 1.00 | 137.5k | 42.3s |
| claude-mcp-claude-opus-4-7 | a1-plate-01 | 1 | 0/1* | 0.33 | 364.9k | 118.4s |
| claude-mcp-claude-opus-4-7 | a1-stepped-shaft-01 | 1 | 1/1* | 1.00 | 402.2k | 69.4s |
| claude-mcp-claude-opus-4-7 | a2-flanged-cap-01 | 1 | 0/1* | 0.43 | 610.0k | 157.7s |
| claude-mcp-claude-opus-4-7 | a2-l-bracket-01 | 1 | 0/1* | 0.17 | 973.8k | 138.2s |
k/k* = fewer than 5 attempts at this (model, task) — pass5 pending. Score is the mean check-pass rate across the most recent 5 attempts. Corpus: 30 run blobs across 2 models, 6 tasks. Click any task, model, or run for full forensic detail.
| drawing mecheval. |
sheet INDEX |
scale pass^5 · 4/5 |
| date 2026-04-28 |
drawn by muni |
project mecheval |