1 / 9
FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Preprint, 2026
We introduce FORGE, a fine-grained multimodal benchmark for manufacturing that combines real-world 2D images and 3D point clouds with domain-specific semantic annotations. Evaluating 18 state-of-the-art MLLMs across workpiece verification, surface inspection, and assembly verification reveals that domain knowledge, rather than visual grounding alone, is the key bottleneck.
Key Insight
In manufacturing MLLMs, domain-specific knowledge is often a more critical bottleneck than raw visual grounding.