System description¶
MLP based system, DCASE2017 baseline¶
A multilayer perceptron based system is selected as baseline system for DCASE2017. The main structure of the system is close to the current state-of-art systems which are based on recurrent neural networks (RNN) and convolutional neural networks (CNN), and therefore it provides a good starting point for further development. The system is implemented around Keras, a high-level neural networks API written in Python. Keras works on top of multiple computation backends, of which Theano was chosen for this system.
System details:
- Acoustic features: Log mel-band energies extracted in 40ms windows with 20ms hop size.
- Machine learning: neural network approach using multilayer perceptron (MLP) type of network (2 layers with 50 neurons each, and 20% dropout between layers).
System hyperparameters
Application | Scene Classification (Task 1) | Binary SED (Task 2) | Multiclass SED (Task 3) | |
---|---|---|---|---|
Audio input | ||||
Channels | 1, in case of multiple audio channels, they are averaged into one channel. | |||
Normalization | No audio normalization | |||
Acoustic features (Librosa 0.5) | ||||
Type | Log mel energies | |||
Window length | 40 ms | |||
Hop length | 20 ms | |||
Mel bands | 40 | |||
Feature vector | ||||
Aggregation | 5 frame context | |||
Length | 200 | |||
Neural Network (Keras 2.0 + Theano with CPU device) | ||||
Layers | 2 Dense layers with 20% Dropout layers between. | |||
Hidden units per layer | 50 | |||
Initialization | Uniform | |||
Activation | ReLU | |||
Output layer type | Softmax | Sigmoid | Sigmoid | |
Optimizer | Adam | |||
Learning rate | 0.001 | |||
Epochs | 200, using early stopping criteria (monitoring started after 100 epoch, 10 epoch patience) | |||
Batch size | 256 | |||
Number of parameters | 12906 | |||
Decision | Binarization per frame + majority vote | Binarization per frame + sliding median filtering | ||
Median filter window | 0.54 sec | 0.54 sec | ||
Binarization threshold | 0.5 |
GMM based approach¶
A secondary system based on Gaussian mixture models is also included in the baseline system in order to enable comparison to the traditional systems presented in the literature. The implementation of the GMM-based system is very similar to the baseline system used in DCASE2016 Challenge for Task 1 and Task 3. See more details about the system used for DCASE2016:
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “TUT database for acoustic scene classification and sound event detection”. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. [PDF]
System details:
- Acoustic features: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values, calculated in 40 ms analysis window with 50% hop size
- Machine learning: Gaussian mixture models, 16 Gaussians per class model
System hyperparameters
Application | Scene Classification (Task 1) | Binary SED (Task 2) | Multiclass SED (Task 3) | |
---|---|---|---|---|
Audio input | ||||
Channels | 1, in case of multiple audio channels, they are averaged into one channel. | |||
Normalization | No audio normalization | |||
Acoustic features (Librosa 0.5) | ||||
Type | MFCC (static, delta, acceleration) | |||
Window length | 40 ms | |||
Hop length | 20 ms | |||
Mel bands | 40 | |||
Number of coefficients | 20 | |||
Delta window | 9 | |||
Feature vector | ||||
Aggregation | mfcc+delta+acc | mfcc (0th omitted) + delta+acc | ||
Length | 60 | 59 | ||
Gaussian Mixtures (Sklearn GaussianMixture) | ||||
Number of Gaussians | 16 | |||
Covariance | diagonal | |||
Number of parameters | 1936 | 1904 | ||
Modelling | One model per scene class | Model pair per event class, negative model and positive model. | ||
Decision | Likelihood accumulation + maximum | Sliding likelihood accumulation + likelihood ratio + thresholding | ||
Accumulation window | Signal length | 0.5 sec | 1.0 sec | |
Decision threshold | 200 | 100 |
Processing blocks¶
The system implements the following basic processing blocks:
- Initialization, see
initialize()
- Prepares the dataset:
- Downloads the dataset from the Internet if needed
- Extracts the dataset package if needed
- Makes sure that the meta files are appropriately formatted
- Feature extraction, see
feature_extraction()
- Goes through all the training material and extracts the acoustic features
- Features are stored file-by-file on the local disk (pickle files)
- Feature normalization, see
feature_normalization()
- Goes through the training material in evaluation folds, and calculates global mean and std of the data per fold.
- Stores the normalization factors per fold(pickle files)
- System training, see
system_training()
- Loads normalizers
- Loads training material file-by-file and forms the feature matrix, normalizes the matrix and optionally aggregates features
- Trains the system with the features and metadata
- Stores the trained acoustic models on the local disk (pickle files)
- System testing, see
system_testing()
- Goes through the testing material and does the classification / detection
- Stores the results (text files)
- System evaluation, see
system_evaluation()
- Reads the ground truth and the output of the system and calculates evaluation metrics