D-SAV360

Dataset 360º Videos with Ambisonics

Our dataset is formed by 50 stereoscopic and 35 monoscopic videos with ambisonic audio. Monoscopic videos have a resolution of 3840x1920, and stereoscopic videos 3840x3840, both with 60 fps and H.264 format. Each video is accompanied by its corresponding ambisonic audio file. The audio files are encoded in WAV format and have a sampling rate of 48 kHz. Please browse the table below to explore the available videos in our dataset. D-SAV360 includes a selection of 35 monoscopic videos from the Youtube video collection of Morgado et al. [2]. The 50 stereoscopic videos were captured using the Obsidian Kandao, which additionally provides depth estimation for the videos, and the ambisonic audio was recorded with a Zoom H2n microphone. The camera was mounted statically on a tripod, ensuring that its height was similar to that of a standing person and that there was no camera movement The depth stereo videos are provided with a resolution of 3840x3840, 50 fps, and H.264 format. Additionally, we offer the raw six fish-eye lenses recordings with a resolution of 3000x2160, 50 fps, mono audio, and H.264 format.

*06-2025 Note: the total number of scanpaths in our dataset is 4,596, which differs from the number reported in the paper.

Alternative download link: https://zenodo.org/records/10043919

Captured Gaze and Head Data

For collecting eye tracking data we used the SRanipal Unity SDK developed for the Tobii eye-tracker integrated into the HTC Vive Pro Eye. We provide head and gaze data for 87 participants and the computed saliency maps for each video frame. Refer to the main paper (Section 5.1) for more information on how the saliency maps are obtained and check our code on GitHub. We provide the code for extracting the ground truth saliency maps in the github repository. The data is provided in a CSV file for each video, with the following columns for the gaze data:

video: video name
stereo: flag indicating if the video was reproduced in stereoscopic mode
frame: number of the video frame with respect to the whole video duration of 60 seconds
t: timestamp of the sample in milliseconds
u: horizontal normalized panorama coordinate (range from 0 to 1)
v: vertical normalized panorama coordinate (range from 0 to 1)
fixation: flag indicating is the sample was classified as a fixation
id: participant's id
valid_fix_classification: flag indicating if the classification as fixation is valid
opness_L: the openness of the left eye (range between 0 and 1)
opness_R: the openness of the right eye (range between 0 and 1)
pupil_L: diameter of the left pupil in mm
pupil_R: diameter of the right pupil in mm

The head data cvs files have the following columns:

video: video name
stereo: flag indicating if the video was reproduced in stereoscopic mode
t: timestamp of the sample in seconds
u: horizontal normalized panorama coordinate (range from 0 to 1)
v: vertical normalized panorama coordinate (range from 0 to 1)
id: participant's id

Alternative download link: https://zenodo.org/records/10043919

Videos with the overlayed saliency maps obtained from participants:

Computed Optical Flow and Audio Energy Maps

We computed an optical flow estimation between frames using RAFT [1]. We provide the optical flow with an RGB representation for each frame in PNG format with a resolution of 1024x540. To compute the optical flow, we downsample the original video frames per second to 8 fps. We also provide the computed Audio Energy Maps for each video frame following the approach of Morgado et al. [2]. The Audio Energy Maps are provided for each frame in PNG. Please adcknowledge the authors and paper of RAFT if you use the optical flow estimation in your research, and Morgado et al. and their paper if you use the Audio Energy Maps.

Alternative download link: https://zenodo.org/records/10043919

UPDATE: We now offer an enhanced version of AEM from the AViSal360 paper. For more details, refer to the main publication and its supplementary materials.

Video Summary

Below, we provide a summary of the videos and their features in the table below. While the original videos have a duration of 60 seconds, we selected the most pertinent 30 seconds of each video for data collection, and as a result, participants' gaze data is only available for 30 seconds. We show in the column "Start Timestamp" the second at which the original 60-seconds video began playing. During the data collection process, we instructed participants to focus their gaze on a red cube to initiate video playback, thus ensuring that all participants started at the same point (see Section 4.1 in the paper for more details). The cube's position was displayed at normalized pixel coordinates (u, v), with 'v' representing the vertical coordinate and 'u' the horizontal coordinate. The precise 'u' coordinate of the cube for each video is provided in the "Cube Position" column in the table, while the 'v' coordinate remained fixed at zero. We additionally display in the table the features of each video (monoscopic, stereoscopic, or depth).

UPDATE: We offer a CSV file with the categorization between indoor/outdoor and exploratory:

D-SAV360

A Dataset of Gaze Scanpaths on 360º Ambisonic Videos

Dataset 360º Videos with Ambisonics

Captured Gaze and Head Data

Computed Optical Flow and Audio Energy Maps

Video Summary

References