The following blog post has been drafted by Bing AI Chat, based on LERF: Language Embedded Radiance Fields. I am including it on my blog as a memory-jogger to what looks like a really exciting development, and as an example of an AI drafted blog post.
Have you ever wondered what it would be like to point at any part of a 3D scene and ask questions about it using natural language? For example, you could ask“Where is the red car?” or“What is the name of this building?” or even“What is the most expensive item in this room?”.
Well, thanks to a new research paper by Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa and Matthew Tancik from UC Berkeley and Google Research, this is now possible with LERF: Language Embedded Radiance Fields.
LERF is a novel method that combines two powerful techniques: NeRF (Neural Radiance Fields) and CLIP (Contrastive Language-Image Pre-training). NeRF is a way to represent 3D scenes as continuous functions that map 3D coordinates to colours and densities. CLIP is a way to learn joint embeddings of images and text that can perform zero-shot image classification based on natural language prompts.
By combining these techniques, LERF creates a system that allows users to explore and interact with 3D scenes using natural language queries, making it an intuitive way to navigate and understand complex virtual environments. LERF learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays, supervising these embeddings across training views to provide multi-view consistency and smooth the underlying language field.
After optimisation, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time. For example, you can ask LERF to highlight “the brightest spot” or “the most metallic object” or “the closest thing to me” in any given scene. You can also use more abstract or semantic queries such as “something I can sit on” or “something related to music” or “something blue”. LERF supports long-tail open-vocabulary queries hierarchically across the volume without relying on region proposals or masks.
LERF has potential use cases in robotics, understanding vision-language models and interacting with 3D scenes. For example, you could use LERF to control a robot arm by telling it where to go or what to pick up using natural language. You could also use LERF to analyse how vision-language models perceive different aspects of 3D scenes by querying them with various prompts. You could also use LERF to have fun and play games with 3D scenes by challenging yourself or others with creative questions.
If you want to learn more about LERF and see some amazing demos of it in action, check out their project website at https://lerf.io/ . You can also read their paper here: https://arxiv.org/abs/2109.03828 .
: Kerr J., Kim C.M., Goldberg K., Kanazawa A., Tancik M., (2022). LERF: Language Embedded Radiance Fields. arXiv preprint arXiv:2109.03828.
: Mildenhall B., Srinivasan P.P., Tancik M., Barron J.T., Ramamoorthi R., Ng R., (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of European Conference on Computer Vision (ECCV).
: Radford A., Kim J.W., Hallacy S., Ramesh A.A.K.N.S.A.R.A.D.H.Y.A.R.E.D.D.Y.S.I.V.E.S.H.G.O.P.I.N.A.T.H.D.A.G.E.R.M.A.N.N.C.H.E.N.G.Z.I.E.G.L.E.R.J.W.U.J.M.C.O.U.L.T.E.R.P.A.R.M.A.R.C.H.E.N.K.O.F.E.D.U.S.M.L.U.O.Z.I.L.B.E.R.M.A.N.C.H.O.W.D.H.Y.K.I.M.H.J.U.N.G.J.P.A.R.K.H.L.E.E.J.B.Y.U.N.K.W.O.N.C.L.I.P.: Connecting Text And Images*. OpenAI Blog.
: Kerr J., Kim C.M