By Caio Viturino and Bobby Carlton
As companies and industries uncover the potential of the metaverse and digital twinning, and leverage it to streamline their workforce, improve employee training, embrace the automation of warehouses and much more, they will need a process that would allow them to quickly and easily create 3D content. This is especially important since the creation of virtual worlds and complex content will become more prevalent for businesses moving forward.
One way of speeding up this process is through something called Neural Radiance Field (NeRF), and this process can help us create and launch 3D digital solutions that can be used in a wide variety of Enterprise case uses. However, there are some questions about the technology.
What is NeRF?
NeRFs are neural representations that represent the geometry of complex 3D scenes. Unlike other methods, such as point clouds and voxel models, they are trained on dense photographic images. They can then produce photo-realistic renderings which can be used in various ways for digital transformation.
This method combines the power of a sparse set of input views with an underlying continuous scene function to generate novel views of complex scenes, and can be taken from a static set of images or something like a blender model.
In a Medium post by Varun Bhaseen, he describes NeRFs as a continuous 5D function outputs the radiance direction (θ; Φ) at each point (x; y; z) in space, and it has a density that acts like a differential opacity to determine how much energy is collected by a ray passing through (x; y; z).
Bhaseen explains it further with a the visual below showing the steps that are involved in optimizing a continuous 5D model for a scene. It takes into account the various factors that affect the view-dependent color and volume density of the scene. In this example, the 100 images were taken as input.
This optimization is performed for a deep multi-layer perceptron, without using any additional layers. To minimize the error between the views that are rendered from the representation and the observed images, gradient descent is used.
Can We Reconstruct the Environment Using Some Equipment?
We can! In addition to being able to model the environment in 6 minutes, the equipment from Mosaic can also generate high-quality 3D models.
Unfortunately, this method is very expensive and requires a lot of training to achieve a high-quality mesh. AI-based methods, on the other hand, seem to do this using a cellphone camera. Another option that could be very useful is NeRFs.
Who First Developed the Well-Known NeRF?
The first NeRF was published in 2020 by Ben Mildenhall. This method achieved state-of-the-art results in 2020 when synthesizing novel views of complex scenes from multi-RGB images. The main drawback then was the time it took for training the network which was almost 2 days per scene, sometimes more, considering Mildenhall was using a NVIDIA V100 GPU.
Why NeRF is Not Well Suited for Mesh Generation?
Unlike surface rendering, NeRF does not use an explicit surface representation, instead it focuses on objects in a density field. This method, unlike surface point for rendering, takes into account multiple locations in a volume in order to determine the color.
NeRF is capable of producing high-quality images, but the surfaces that are extracted as level sets are not ideal. This is because NeRF does not take into account the specific density levels that are required to represent the surface.
In a paper released by Nvidia, they introduced a new method called instant NeRF, which can generate a high-quality image of a density and radiance-and-density field. Unfortunately, this method was not able to produce good meshes as well. The meshes generated through this approach did produce a decent volumetric radiance and density field, however they seemed "noisy".
What If We Use Photogrammetry Instead?
Unlike photogrammetry, NeRF does not require the creation of point clouds, nor does it need to convert them to objects. Its output is faster, but unfortunately the mesh quality is not as good. In the example here, Caoi Viturino, Simulations Developer for FS Studio, tested the idea of generating meshes of an acoustic guitar from the NeRF volume rendering by using the Nvidia NeRF instant. The results are pretty bad with lots of "noise".
Viturino also tried to apply photogrammetry (using a simple cell phone camera) through existing software to compare with NeRF mesh output using the same set of images. We can see that the output looks better but NeRF can capture more details of the object.
Can NeRF Be Improved to Represent Indoor Environments?
In a paper released by Apple, the team led by Terrance DeVries explained how they were able to improve the NeRF model by learning to decompose large scenes into smaller pieces. Although they did not talk about surface or mesh generation, they did create a global generator that can perform this task.
Unfortunately, the algorithm's approach to generating a mesh is not ideal. The problem with NeRF is that the algorithm generates a volumetric radiance-and-density field instead of a surface representation. Different approaches tried to generate a mesh from the volumetric field, but it was for single objects only (360 degrees scan):
Can NeRF Be Improved to Generate Meshes?
It is well known that NeRF does not admit accurate surface reconstruction. Therefore, some suggest that the algorithm should be merged with implicit surface reconstruction.
Michael Oechsle (2021) published a paper that unifies volume rendering and implicit surface reconstruction and can reconstruct meshes from objects more precisely if compared to NeRF. However, the method needs to be applied to single objects instead of scene reconstruction.
Do We Really Need a Mesh of the Scene or Can We Use the Radiance Field Instead?
NeRF is more accurate than point clouds or voxel models when it comes to surface reconstruction. It does not need to perform precise feature extraction and alignment.
Michal Adamkiewicz used the NeRF to perform a trajectory optimization for a quadrotor robot in the radiance field produced by NeRF instead of using the 3D scene mesh. The NeRF environment used to test the trajectory planning algorithms was generated from a synthetic 3D scene.
Unfortunately, it is not easy to create a mesh from the NeRF environment. To load the scene into Isaac Sim, we need a mesh representation of the NeRF.
Can We Map an Indoor Environment Using NeRF?
According to a report written by Xiaoshuai Zhang (2022), not yet. “While NeRF has shown great success for neural reconstruction and rendering, its limited MLP capacity and long per-scene optimization times make it challenging to model large-scale indoor scenes.”
The goal of Zhang’s paper is to incrementally reconstruct a large sparse radiance field from a long RGB image sequence (monocular RGB video). Although impressive and promising, 3D reconstruction from RGB images does not seem to be satisfactory yet. We can observe noise in the mesh produced by this method.
What If We Use RGB-D Images Instead of RGB Images?
Dejan Azinović (2022) proposed a new approach to generating 3D reconstruction of scenes that is much better than NeRF.
The image below shows how noisy the 3D mesh generated by the first proposed NeRF is compared to the Neural RGB-D surface reconstruction.
Enter the SNeRF!
A recent study conducted by Cornell University revealed that creating a variety of dynamic virtual scenes using neural radiance fields can be done at a speed that is more than enough to handle the complexity of complex content. This is a stylized neural radiance field (SNeRF).
Led by researchers Lei Xiao, Feng Liu, and Thu Nguyen-Phuoc, the team was able to create 3D scenes that can be used in various virtual environments simply by using SNeRF to adapt to the real-world environment and then use points to create the virtual scene. Imagine looking at a painting and then seeing the world through the lens of the painting.
What Can SNeRFs Do?
Through their work, they were able to create 3D scenes that can be used in various virtual environments. They were also able to use their real-world environment as a part of the creation process.
The researchers were able to achieve this by using cross-view consistency, which is a type of visual feedback that allows them to observe the same object at different viewing angles, creating an immersive 3D effect.
They were able to create an immersive 3D effect by using cross-view consistency. This type of visual feedback allowed them to observe the same object at different viewing angles.
The Cornell team was also able to create an image as a reference style and then use it as a part of their creation process by alternating the NeRF and the stylization optimization steps. This method allowed them to quickly create a real-world environment and customize the image.
“We introduce a new training method to address this problem by alternating the NeRF and stylization optimization steps,” said the research team in their published paper. “Such a method enables us to make full use of our hardware memory capacity to both generate images at higher resolution and adopt more expressive image style transfer methods. Our experiments show that our method produces stylized NeRFs for a wide range of content, including indoor, outdoor and dynamic scenes, and synthesizes high-quality novel views with cross-view consistency.”
The researchers had to address another issue with the NeRF memory limitations, which they had to solve in order to render more high-quality 3D images at a speed that felt like real-time. This method involved creating a loop of views that would allow them to target the appropriate points in the image and then rebuild it with more detail.
Can SNeRF Help Avatars?
Through this approach, Lei Xiao, Feng Liu, and Thu Nguyen-Phuoc were able to create expressive 4D avatars that can be used in conversations. They were also able to create these avatars by using a distinct style of NeRF that allows them to convey emotions such as anger, confusion, and fear.
Currently the work being done by the Cornell research team on 3D scene stylization is still ongoing. They were able to create a method that uses implicit neural representations to affect the avatars' environment. They were also able to take advantage of their hardware memory's capabilities to create high-resolution images and adopt more expressive methods in virtual reality.
However, this is just the beginning and there is a lot more work and exploration ahead.
If you’re interested in diving deeper into the Cornell research teams work, you can access their report here.