What is a neural radiance field (NeRF)?
Neural radiance fields (NeRFs) are a technique that generates 3D representations of an object or scene from 2D images by using advanced machine learning. The technique involves encoding the entire object or scene into an artificial neural network, which predicts the light intensity -- or radiance -- at any point in the 2D image to generate novel 3D views from different angles.
The process is analogous to how holograms can encode different perspectives, which are unlocked by shining a laser from different directions. In the case of NeRFs, instead of shining a light, an app sends a query indicating the desired viewing position and viewport size, and the neural network generates the color and density of each pixel in the resulting image.
NeRFs show incredible promise in representing 3D data more efficiently than other techniques and could unlock new ways to generate highly realistic 3D objects automatically. Used with other techniques, NeRFs have incredible potential for massively compressing 3D representations of the world from gigabytes to tens of megabytes. Time magazine called a NeRF implementation from the Silicon Valley chipmaker Nvidia one of the top inventions of 2022. Nvidia Director of Research Alexander Keller told Time that NeRFs "could ultimately be as important to 3D graphics as digital cameras have been to modern photography."
Applications of neural radiance fields
NeRFs can be used to generate 3D models of objects as well as for rendering 3D scenes for video games and for virtual and augmented reality environments in the metaverse.
This article is part of
Google has already started using NeRFs to translate street map imagery into immersive views in Google Maps. Engineering software company Bentley Systems has also used NeRFs as part of its iTwin Capture tool to analyze and generate high-quality 3D representations of objects using a phone camera.
Down the road, NeRFs could complement other techniques for representing 3D objects in the metaverse, augmented reality and digital twins more efficiently and accurately -- and realistically.
One big plus of NeRFs is that they operate on light fields that characterize shapes, textures and material effects directly -- the way different materials like cloth or metal look in light, for example. In contrast, other 3D processing techniques start with shapes and then add on textures and material effects using secondary processes.
Early applications. Early NeRFs were incredibly slow and required all of the pictures to be taken using the same camera in the same lighting conditions. First-generation NeRFs described by Google and University of California, Berkeley, researchers in 2020 took two or three days to train and required several minutes to generate each view. The early NeRFs focused on individual objects, such as a drum set, plants or Lego toys.
Ongoing innovation. In 2022, Nvidia pioneered a variant called Instant NeRFs that could capture fine detail in a scene in about 30 seconds and then render different views in about 15 milliseconds. Google researchers also reported new techniques for NeRF in the Wild, a system that can create NeRFs from photos taken by various cameras, in different lighting conditions, and with temporary objects in the scene. This also paved the way for using NeRFs to generate content variations based on simulated lighting conditions or time-of-day differences.
Emerging NeRF applications. Today, most NeRF applications render individual objects or scenes from different perspectives rather than combining objects or scenes. For example, the first Google Maps implementation used NeRF technology to create a short movie simulating a helicopter flying around a building. This eliminated the challenges of computing the NeRF on different devices and rendering multiple buildings. However, researchers are exploring ways to extend NeRFs to generate high-quality geospatial data as well. This would make it easier to render large scenes. NeRFs could eventually also provide a better way of storing and rendering other types of imagery, such as MRI and ultrasound scans.
Research on NeRFs is progressing at feverish pace, with much of it aimed at improving the speed, precision or fidelity of the 3D representations as well as expanding use cases.
Here are some references to early NeRF implementations, starting with the original paper on NeRF:
- "NeRF: Representing scenes as neural radiance fields for view synthesis," Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng.
- NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections.
- Getting Started with NVIDIA Instant NeRFs: A helpful tutorial from Nvidia Developer.
- Awesome Neural Radiance Fields: A curated list of NeRF papers organized by use case.
How do neural radiance fields work?
The term neural radiance field describes the different elements in the technique. It is neural in the sense that it uses a multilayer perceptron, an older neural network architecture, to represent the image. Radiance refers to the fact that this neural network models the brightness and color of rays of light from different perspectives. Field is a mathematical term describing a model for transforming various inputs into outputs using a particular structure.
NeRFs work differently from other deep learning techniques in that a series of images is used to train a single fully connected neural network that can only be used to generate new views of that one object. In comparison, deep learning starts by using labeled data to train the neural network, which could provide appropriate responses for similar types of data.
The actual operation of the neural network uses the 3D physical location and 2D direction (left-right and up-down) the simulated camera is pointing at for the input and then generates a response as a color and density for each pixel in the image. This reflects how rays of light bounce off objects from that view in space.
Training neural radiance fields
NeRFs are trained from images of an object or scene captured from different points of view. The training algorithm then calculates the relative position each image was taken from and then uses this data to adjust the weights on the neural network nodes until their output matches these images.
Here is the process in detail:
- The training process starts with a collection of images of a single object or scene taken from different perspectives, ideally from the same camera. In the very first step, a computational photography algorithm calculates the location and direction of the camera for each photo in the collection of photos.
- The information from the pictures and the location is then used to train the neural network. The difference between pixels in these images and the expected results is used to tune the neural network weights. The process is repeated 200,000 times or so, and the network converges on a decent NeRF. The early versions took days -- but, as noted, recent Nvidia optimizations enable the whole thing to happen in parallel in tens of seconds.
- There is one more step NeRF developers are still trying to understand. When researchers first started experimenting with NeRFs, the images looked like smooth blurry blobs that lacked the rich texture of natural objects. So, they added a bit of digital noise to the rays to enhance the ability of the NeRF to capture finer textures. This early noise consisted of relatively simple cosine and sine waves, while later versions turned to Fourier transforms to achieve better results. Adjusting this level of noise helps tune in the desired resolution. Too little, and the scene looks smooth and washed out; too much, and it looks pixelated. While most researchers stuck with Fourier transforms, Nvidia took it one step further with a new encoding technique called multi-resolution hash encoding that it cites as a critical factor for producing superior results.
What are the limitations and challenges of neural radiance fields?
In the early days, NeRFs required a lot of compute power, needed a lot of pictures and were not easy to train. Today, the compute and training are less of an issue, but they still require a lot of pictures. Other key NeRF challenges include speed, editability and composability:
- Time-intensive, but getting less so. On the speed front, training a NeRF requires hundreds of thousands of rounds of training. Early versions took several days on a single GPU. However, Nvidia has demonstrated a way to overcome this challenge through more efficient parallelization and optimization, which can generate a new NeRF in tens of seconds and render new views in tens of milliseconds.
- Challenging to edit, but getting easier. The editability challenge is a bit trickier. A NeRF captures and collects different views of objects into a neural network. This is much less intuitive to edit than other kinds of 3D formats, such as 3D meshes representing the surface of objects or voxels -- 3D pixels -- representing their 3D structure. Google's work on NeRF in the Wild suggested ways to change the color and lighting and even remove unwanted objects that appear in some of the images. For example, the technique can remove buses and tourists from pictures of the Brandenburg Gate in Berlin taken by multiple people.
- Composability remains a hurdle. The composability challenge relates to the fact that researchers have not found an easy way to combine multiple NeRFs to compose larger scenes. This could make it hard in certain use cases, such as rendering simulated factory layouts composed of NeRFs of individual pieces of equipment or creating virtual worlds that combine NeRFs of various buildings.