🐦Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation 🐦
- Kam Woh Ng1
- Jing Yang2
- Jia Wei Sii3
- Jiankang Deng4
- Chee Seng Chan3
- Yi-Zhe Song1
- Tao Xiang1
- Xiatian Zhu1
- University of Surrey1
- University of Cambridge2
- Universiti Malaya3
- Imperial College London4
Abstract
In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects -- we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp!
Methodology
Overall architecture of our \( \textbf{Chirpy3D} \). \( \textbf{(Top)} \) During training, we fine-tune a text-to-multi-view diffusion model (e.g., MVDream) with only 2D images of birds. We aim to learn the underlying part information by modeling a continuous part-aware latent space. This is achieved by learning a set of species embeddings \( \mathbf{e} \), project them into part latents \( \mathbf{l} \) through learnable \( f \), decode into word embeddings \( \mathbf{t} \) through learnable \( g \) and insert into text prompt. We train the diffusion model with diffusion loss and multiple loss objectives -- \( \mathcal{L}_{\text{reg}} \) to model part latents as Gaussian distribution, \( \mathcal{L}_{\text{attn}} \) for part disentanglement, and our proposed \( \mathcal{L}_{\text{cl}} \) to enhance visual coherency. \( f \) and \( g \) are trainable modules. For efficient training, we added LoRA layers into cross-attention layers of the U-Net. \( \textbf{(Bottom)} \) During inference, we can first preview multi-view images by selecting desired part latents as condition before turning them into 3D representations (e.g., NeRF) through SDS loss \( \mathcal{L}_\text{SDS} \).
Multiview Generation
Not only we can generate multiview images of existing classes, we can also generate hybrid version of them (randomly interpolated for each part).The following multiview images are generated by randomly sampled part latents.
3D Birds
3D generation usingthreestudio
with random sampled part latents.
some examples from CUB200.
Citation
@misc{ng2024chirpy3d,
title={Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation},
author={Kam Woh Ng and Jing Yang and Jia Wei Sii and Jian Kang Deng and Chee Seng Chan and Yi-Zhe Song and Tao Xiang and Xiatian Zhu},
year={2025},
eprint={2501.04144},
archivePrefix={arXiv},
primaryClass={cs.CV}
}