System description

_images/system_approach.svg

System block diagram.

MLP based system, DCASE2017 baseline

A multilayer perceptron based system is selected as baseline system for DCASE2017. The main structure of the system is close to the current state-of-art systems which are based on recurrent neural networks (RNN) and convolutional neural networks (CNN), and therefore it provides a good starting point for further development. The system is implemented around Keras, a high-level neural networks API written in Python. Keras works on top of multiple computation backends, of which Theano was chosen for this system.

System details:

  • Acoustic features: Log mel-band energies extracted in 40ms windows with 20ms hop size.
  • Machine learning: neural network approach using multilayer perceptron (MLP) type of network (2 layers with 50 neurons each, and 20% dropout between layers).

System hyperparameters

Application Scene Classification (Task 1) Binary SED (Task 2) Multiclass SED (Task 3)  
Audio input
Channels 1, in case of multiple audio channels, they are averaged into one channel.
Normalization No audio normalization
Acoustic features (Librosa 0.5)
Type Log mel energies
Window length 40 ms
Hop length 20 ms
Mel bands 40
Feature vector
Aggregation 5 frame context
Length 200
Neural Network (Keras 2.0 + Theano with CPU device)
Layers 2 Dense layers with 20% Dropout layers between.
Hidden units per layer 50
Initialization Uniform
Activation ReLU
Output layer type Softmax Sigmoid Sigmoid  
Optimizer Adam
Learning rate 0.001  
Epochs 200, using early stopping criteria (monitoring started after 100 epoch, 10 epoch patience)
Batch size 256
Number of parameters 12906
Decision Binarization per frame + majority vote Binarization per frame + sliding median filtering  
Median filter window   0.54 sec 0.54 sec  
Binarization threshold 0.5  

GMM based approach

A secondary system based on Gaussian mixture models is also included in the baseline system in order to enable comparison to the traditional systems presented in the literature. The implementation of the GMM-based system is very similar to the baseline system used in DCASE2016 Challenge for Task 1 and Task 3. See more details about the system used for DCASE2016:

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “TUT database for acoustic scene classification and sound event detection”. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. [PDF]

System details:

  • Acoustic features: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values, calculated in 40 ms analysis window with 50% hop size
  • Machine learning: Gaussian mixture models, 16 Gaussians per class model

System hyperparameters

Application Scene Classification (Task 1) Binary SED (Task 2) Multiclass SED (Task 3)  
Audio input
Channels 1, in case of multiple audio channels, they are averaged into one channel.
Normalization No audio normalization
Acoustic features (Librosa 0.5)
Type MFCC (static, delta, acceleration)  
Window length 40 ms  
Hop length 20 ms  
Mel bands 40
Number of coefficients 20  
Delta window 9  
Feature vector  
Aggregation mfcc+delta+acc mfcc (0th omitted) + delta+acc  
Length 60 59  
Gaussian Mixtures (Sklearn GaussianMixture)
Number of Gaussians 16  
Covariance diagonal  
Number of parameters 1936 1904  
Modelling One model per scene class Model pair per event class, negative model and positive model.  
Decision Likelihood accumulation + maximum Sliding likelihood accumulation + likelihood ratio + thresholding  
Accumulation window Signal length 0.5 sec 1.0 sec  
Decision threshold   200 100  

Processing blocks

_images/system_processing_blocks.svg

Processing blocks of the system.

The system implements the following basic processing blocks:

  1. Initialization, see initialize()
  • Prepares the dataset:
    • Downloads the dataset from the Internet if needed
    • Extracts the dataset package if needed
    • Makes sure that the meta files are appropriately formatted
  1. Feature extraction, see feature_extraction()
  • Goes through all the training material and extracts the acoustic features
  • Features are stored file-by-file on the local disk (pickle files)
  1. Feature normalization, see feature_normalization()
  • Goes through the training material in evaluation folds, and calculates global mean and std of the data per fold.
  • Stores the normalization factors per fold(pickle files)
  1. System training, see system_training()
  • Loads normalizers
  • Loads training material file-by-file and forms the feature matrix, normalizes the matrix and optionally aggregates features
  • Trains the system with the features and metadata
  • Stores the trained acoustic models on the local disk (pickle files)
  1. System testing, see system_testing()
  • Goes through the testing material and does the classification / detection
  • Stores the results (text files)
  1. System evaluation, see system_evaluation()
  • Reads the ground truth and the output of the system and calculates evaluation metrics