D-SAV360

A Dataset of Gaze Scanpaths on 360◦ Ambisonic Videos

Dataset 360º Videos with Ambisonics

Our dataset is formed by 50 stereoscopic and 35 monoscopic videos with ambisonic audio. Monoscopic videos have a resolution of 3840x1920, and stereoscopic videos 3840x3840, both with 60 fps and H.264 format. Each video is accompanied by its corresponding ambisonic audio file. The audio files are encoded in WAV format and have a sampling rate of 48 kHz. Please browse the table below to explore the available videos in our dataset. D-SAV360 includes a selection of 35 monoscopic videos from the Youtube video collection of Morgado et al. [2]. The 50 stereoscopic videos were captured using the Obsidian Kandao, which additionally provides depth estimation for the videos, and the ambisonic audio was recorded with a Zoom H2n microphone. The camera was mounted statically on a tripod, ensuring that its height was similar to that of a standing person and that there was no camera movement The depth stereo videos are provided with a resolution of 3840x3840, 50 fps, and H.264 format. Additionally, we offer the raw six fish-eye lenses recordings with a resolution of 3000x2160, 50 fps, mono audio, and H.264 format.

Alternative download link: https://zenodo.org/records/10043919

Captured Gaze and Head Data

For collecting eye tracking data we used the SRanipal Unity SDK developed for the Tobii eye-tracker integrated into the HTC Vive Pro Eye. We provide head and gaze data for 87 participants and the computed saliency maps for each video frame. Refer to the main paper (Section 5.1) for more information on how the saliency maps are obtained. We provide the code for extracting the ground truth saliency maps in the github repository. The data is provided in a CSV file for each video, with the following columns for the gaze data:

The head data cvs files have the following columns:

Alternative download link: https://zenodo.org/records/10043919


Videos with the overlayed saliency maps obtained from participants:


Computed Optical Flow and Audio Energy Maps

We computed an optical flow estimation between frames using RAFT [1]. We provide the optical flow with an RGB representation for each frame in PNG format with a resolution of 1024x540. To compute the optical flow, we downsample the original video frames per second to 8 fps. We also provide the computed Audio Energy Maps for each video frame following the approach of Morgado et al. [2]. The Audio Energy Maps are provided for each frame in PNG. Please adcknowledge the authors and paper of RAFT if you use the optical flow estimation in your research, and Morgado et al. and their paper if you use the Audio Energy Maps.

Alternative download link: https://zenodo.org/records/10043919

Video Summary

Below, we provide a summary of the videos and their features in the table below. While the original videos have a duration of 60 seconds, we selected the most pertinent 30 seconds of each video for data collection, and as a result, participants' gaze data is only available for 30 seconds. We show in the column "Start Timestamp" the second at which the original 60-seconds video began playing. During the data collection process, we instructed participants to focus their gaze on a red cube to initiate video playback, thus ensuring that all participants started at the same point (see Section 4.1 in the paper for more details). The cube's position was displayed at normalized pixel coordinates (u, v), with 'v' representing the vertical coordinate and 'u' the horizontal coordinate. The precise 'u' coordinate of the cube for each video is provided in the "Cube Position" column in the table, while the 'v' coordinate remained fixed at zero. We additionally display in the table the features of each video (monoscopic, stereoscopic, or depth).

Name Representative Frame Start Timestamp Cube Position Monoscopic Stereoscopic Depth

References

[1] Z. Teed and J. Deng. RAFT: Recurrent All Pairs Field Transforms for Optical Flow. In ECCV, 2020.

[2] P. Morgado, N. Vasconcelos, T. Langlois and O. Wang. Spatial Audio Generation. In NIPS, 2018.