Chirpy3D

🐦Chirpy3D: Creative Fine-grained 3D Object Fabrication via Part Sampling 🐦

University of Surrey¹
University of Cambridge²
Universiti Malaya³
Imperial College London⁴

Abstract

In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects -- we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp!

Methodology

Overall architecture of our \( \textbf{Chirpy3D} \). \( \textbf{(Top)} \) During training, we fine-tune a text-to-multi-view diffusion model (e.g., MVDream) with only 2D images of birds. We aim to learn the underlying part information by modeling a continuous part-aware latent space. This is achieved by learning a set of species embeddings \( \mathbf{e} \), project them into part latents \( \mathbf{l} \) through learnable \( f \), decode into word embeddings \( \mathbf{t} \) through learnable \( g \) and insert into text prompt. We train the diffusion model with diffusion loss and multiple loss objectives -- \( \mathcal{L}_{\text{reg}} \) to model part latents as Gaussian distribution, \( \mathcal{L}_{\text{attn}} \) for part disentanglement, and our proposed \( \mathcal{L}_{\text{cl}} \) to enhance visual coherency. \( f \) and \( g \) are trainable modules. For efficient training, we added LoRA layers into cross-attention layers of the U-Net. \( \textbf{(Bottom)} \) During inference, we can first preview multi-view images by selecting desired part latents as condition before turning them into 3D representations (e.g., NeRF) through SDS loss \( \mathcal{L}_\text{SDS} \).

Multiview Generation

Not only we can generate multiview images of existing classes, we can also generate hybrid version of them (randomly interpolated for each part).

Existing Class (A)

Hybrid Class (A+B)

Existing Class (B)

The following multiview images are generated by randomly sampled part latents.

3D Birds

3D generation using threestudio with random sampled part latents.

some examples from CUB200.

additional examples from PartImageNet and sims4-faces.

Citation

@misc{ng2024chirpy3d,
      title={Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation},
      author={Kam Woh Ng and Jing Yang and Jia Wei Sii and Jian Kang Deng and Chee Seng Chan and Yi-Zhe Song and Tao Xiang and Xiatian Zhu},
      year={2025},
      eprint={2501.04144},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🐦Chirpy3D: Creative Fine-grained 3D Object Fabrication via Part Sampling 🐦

Paper

Arxiv

Code

Abstract

Methodology

Multiview Generation

3D Birds

Citation