6. Dataset
● Text (csv) files describing, for each segment, the YouTube video ID, start time, end time, and
one or more labels.
● 128-dimensional audio features extracted at 1Hz. The audio features were extracted using a
VGG-inspired acoustic model. The features were PCA-ed and quantized to be compatible with
the audio features provided with YouTube-8M. They are stored as TensorFlow Record files
7. Dataset
● Evaluation
20,383 segments from distinct videos, providing at least 59 examples for each of the 527 sound
classes that are used. Because of label co-occurrence, many classes have more examples.
● Balanced train
22,176 segments from distinct videos chosen with the same criteria: providing at least 59
examples per class with the fewest number of total segments.
● Unbalanced train
2,042,985 segments from distinct videos, representing the remainder of the dataset.
11. Training results
Model name Training time Training last step hit Evaluation average hit
Logistic 14m 3s 0.5859 0.556
Dbof 31m 46s 1 0.522
Lstm 1h 45m 53s 0.9883 0.4581
Balanced Training
Unbalanced Training
Model name Training time Training last step hit Evaluation average hit
Logistic 2h 4m 14s 0.875 0.5125
Dbof 4h 39m 29s 0.8848 0.5605
Lstm 9h 42m 52s 0.8691 0.5396