Moving around in the world is naturally a multisensory experience, but today’s embodied agents are deaf—restricted to solely their visual perception of the environment. We explore audio-visual learning in complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object, use echolocation to anticipate its 3D surroundings, and discover the link between its visual inputs and spatial sound.

To support this goal, we introduce SoundSpaces: a platform for audio rendering based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica). SoundSpaces makes it possible to insert arbitrary sound sources in an array of real-world scanned environments. Building on this platform, we pursue a series of audio-visual spatial learning tasks. Specifically, in audio-visual navigation, the agent is tasked with traveling to a sounding target in an unfamiliar environment (e.g., go to the ringing phone). In audio-visual floorplan reconstruction, a short video with audio is converted into a house-wide map, where audio allows the system to “see” behind the camera and behind walls. For self-supervised feature learning, we explore how echoes observed in training can enrich an RGB encoder for downstream spatial tasks including monocular depth estimation. Our results suggest how audio can benefit visual understanding of 3D spaces, and our research lays groundwork for new

Partager cet évènement