By Hua Bao, Senior Audio DSP Engineer at GoPro

Virtual Reality is emerging as a fascinating technology with the potential to change how people communicate, share and entertain. The essence of VR is using technology to fool your brain into believing that you are tangibly experiencing the content that you are viewing.
My colleague, Alexandre Jenny, recently wrote about the importance of quality stitching to deliver a truly immersive experience. Another key component to delivering a truly immersive experience is audio. If audio is not captured and applied spatially to the video, when you experience the content, your brain is reminded that the scene is not completely realistic. For example: imagine you see someone talking in-front of you. But you turn your head to the right and now the person is on your left in the video, but you still hear the voice from in-front of you. This would be a flawed experience and is why we’ve invested time and expertise into the design of Fusion to enable precise spatial audio. 

Spatial audio is also known as 3D audio, 360 audio, VR audio, etc. Its goal is to align the sound–spatially–with the video that is captured. Conventional stereophonic audio locks the sound with your head. However, spatial audio can lock the sound in space. With spatial audio, a complex process of recording and transformation allows the direction and proximity of sounds to align with the video file. This enables the perceived location of the source of sounds to match what you see in the video content when you drag the video on a computer screen, pan with your smartphone, or use a Head-Mounted Display (HMD) to fully immerse in the content.

Above: A demo for spatial audio with Fusion recording on turning table. Sound is played from the hanging round speaker. Listen with headphone.

To capture spatial audio with Fusion, we’ve equipped the device with four microphones: three on the top and one in the front. Microphone placement is one of the most important design factors our team considered – proper microphone placement and geometry helps attain spatial accuracy and resolution in Ambisonics, the format used to define spatial audio–which allows for dynamic rendering of spatial audio into stereo sound based on the viewers realtime perspective (or POV). This process, in the engineering community, is known as binauralization.

Similar to the case with lenses, no two microphones are identical or perfect out of the factory. A process of acoustic calibration during production bridges the gap between the conceptual design of the spatial-audio system and the final product that is shipped to customers. After calibration, a proprietary transformation algorithm we’ve developed converts the raw recordings captured with Fusion’s microphones into Ambisonics.

Besides Ambisonics transformation, other audio signal processing techniques are necessary to enhance the overall audio quality we strive for. Automatic gain control (AGC)–which adjusts the level of the recorded sounds–is one of them. Unlike processing in mono/stereo format, processing in the Ambisonics format is challenging due to the requirement of maintaining spatial cue. After processing, sound should remain in its naturally recorded direction.

Finally, before uploading to YouTube or other platforms that support spherical content, the files are injected with metadata that signals to those platforms that the video and audio are from a spherical recording.

Above: Example of spatial audio out in the wild.

The audio team at GoPro is constantly exploring new ways to deliver audio with best-in-class quality. While spatial audio may seem like the natural way to present spherical content, it has not yet become standard for consumer cameras. This is because the technology is not trivial, and we are proud of the advancements we are delivering with Fusion. With GoPro’s efforts as a pioneer in the field of spherical and 360-degree content capture we’ve been refining our capabilities with spherical audio. By combining best-in-class spatial audio with Fusion’s spherical video, we are excited to be enabling a fully immersive experience: see what you hear and hear what you see, just like you do in real-life.