Saturday, October 1st, 2022 Posted by Jim Thacker

Google’s DreamFusion turns text into 3D models


Google Research has unveiled DreamFusion, a new method of generating 3D models from text prompts.

The approach, which combines a text-to-2D-image diffusion model with Neural Radiance Fields (NeRF), generates textured 3D models of a quality suitable for use in AR projects, or as base meshes for sculpting.

And crucially, it does not require a set of real 3D models to use as training data – potentially paving the way to the development of practical, mass-market AI-based text-to-3D tools.

DreamFusion turns text descriptions into textured 3D models
Developed by a team from Google Research and UC Berkeley, DreamFusion generates 3D models from text descriptions like ‘A highly detailed metal sculpture of a squirrel wearing a kimono playing the saxophone’.

As well the geometry of the 3D model, the text also defines its materials and textures – something you can try in the online demo by swapping out ‘metal sculpture’ for ‘wooden carving’ or ‘DSLR photo’.



Combining Neural Radiance Fields and 2D diffusion
To generate a model, DreamFusion combines two key approaches: Neural Radiance Fields and 2D diffusion.

It progressively refines an initial, random 3D model to match 2D reference images showing the target object from different angles: an approach used by existing AI models like Nvidia’s Instant NeRF.

However, unlike Instant NeRF, the references are not photos of a real object, but synthetic images generated by a 2D text-to-image model of the the type used by OpenAI’s DALL-E 2 and Stability.ai’s Stable Diffusion.

In this case, the 2D diffusion model is Google’s own Imagen, but the overall result is the same: a 3D model generated to match 2D reference images generated from the original text description.

Still very much just a research demo
At the minute, the opportunities to play with DreamFusion are fairly limited.

The project’s GitHub page let you choose from a range of preset text prompts, then displays the resulting 3D model, but it doesn’t let you enter your own text descriptions.

The assets themselves are also fairly low-resolution.

DreamFusion’s online gallery shows a range of models in .glb format that look like they would be suitable for use in an AR project, or as base meshes that could be refined manually for use in higher-detail work.

Paving the way to a new generation of commercial text-to-3D tools?
However, the real significance of research projects like DreamFusion is not what they can currently do, but how they could open the way for the development of more practical tools.

Whereas 2D diffusion models like DALL-E 2 were trained on 2D images scraped from the internet, it’s much harder to do the same thing for 3D.

As the abstract for DreamFusion puts it: “Adapting this approach to 3D synthesis would require large-scale datasets of labeled 3D assets and efficient [ways of] denoising 3D data, neither of which currently exist.”

By doing away with the need for such large-scale 3D datasets, DreamFusion raises the possibility of a new wave of generative AI art tools, but for 3D models, not 2D images.

And given that 2D AI art tools like DALL-E took less than two years to go from initial announcement to mass public availability, they could be here much sooner than you think.

Read more about DreamFusion on the project’s GitHub page
(Research paper and demo models only: no actual code)