Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion



Teaser Given 3D part meshes and a condition image, Assembler can effectively assemble them into a complete 3D object. Parts are labeled in different colors.

Abstract

We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresses the core challenges of scaling to general 3D part assembly through innovations in task formulation, representation, and data. First, Assembler casts part assembly as a generative problem and employs diffusion models to sample plausible configurations, effectively capturing ambiguities arising from symmetry, repeated parts, and multiple valid assemblies. Second, we introduce a novel shape-centric representation based on sparse anchor point clouds, enabling scalable generation in Euclidean space rather than SE(3) pose prediction. Third, we construct a large-scale dataset of over 320K diverse part-object assemblies using a synthesis and filtering pipeline built on existing 3D shape repositories. Assembler achieves state-of-the-art performance on PartNet and is the first to demonstrate high-quality assembly for complex, real-world objects. Based on Assembler, we further introduce an interesting part-aware 3D modeling system that generates high-resolution, editable objects from images, demonstrating potential for interactive and compositional design.

pipeline

Pipeline. Overview of Assembler (Left) and part-aware 3D generation pipeline (Right). (Left) The input part meshes are sampled as anchor points representation, followed by DoRA to extract shape features. These shape features are concatenated with noised point tokens, and a diffusion model is trained to generate assembled anchor points. After that, a simple least-squares fitting is used to compute part poses from generated and input anchor points to assemble the input meshes as a final object. (Right) The input image is first fed into VLMs to infer the parts and generate reference images for each part. Then an image-to-3D generator is applied to produce part meshes. Finally, Assembler generates complete, high-resolution, part-aware 3D models by assembling the part meshes.

Comparison of 3D Part Assembly on PartNet

partnet

3D Part Assembly on Toys4K

toys4k

Example of Part-aware 3D Generation System