System description¶

System block diagram.

MLP based system, DCASE2017 baseline¶

A multilayer perceptron based system is selected as baseline system for DCASE2017. The main structure of the system is close to the current state-of-art systems which are based on recurrent neural networks (RNN) and convolutional neural networks (CNN), and therefore it provides a good starting point for further development. The system is implemented around Keras, a high-level neural networks API written in Python. Keras works on top of multiple computation backends, of which Theano was chosen for this system.

System details:

Acoustic features: Log mel-band energies extracted in 40ms windows with 20ms hop size.
Machine learning: neural network approach using multilayer perceptron (MLP) type of network (2 layers with 50 neurons each, and 20% dropout between layers).

System hyperparameters

Application	Scene Classification (Task 1)	Binary SED (Task 2)	Multiclass SED (Task 3)
Audio input
Channels	1, in case of multiple audio channels, they are averaged into one channel.
Normalization	No audio normalization
Acoustic features (Librosa 0.5)
Type	Log mel energies
Window length	40 ms
Hop length	20 ms
Mel bands	40
Feature vector
Aggregation	5 frame context
Length	200
Neural Network (Keras 2.0 + Theano with CPU device)
Layers	2 Dense layers with 20% Dropout layers between.
Hidden units per layer	50
Initialization	Uniform
Activation	ReLU
Output layer type	Softmax	Sigmoid	Sigmoid
Optimizer	Adam
Learning rate	0.001
Epochs	200, using early stopping criteria (monitoring started after 100 epoch, 10 epoch patience)
Batch size	256
Number of parameters	12906
Decision	Binarization per frame + majority vote	Binarization per frame + sliding median filtering
Median filter window		0.54 sec	0.54 sec
Binarization threshold	0.5

GMM based approach¶

A secondary system based on Gaussian mixture models is also included in the baseline system in order to enable comparison to the traditional systems presented in the literature. The implementation of the GMM-based system is very similar to the baseline system used in DCASE2016 Challenge for Task 1 and Task 3. See more details about the system used for DCASE2016:

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “TUT database for acoustic scene classification and sound event detection”. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016. [PDF]

System details:

Acoustic features: 20 MFCC static coefficients (including 0th) + 20 delta MFCC coefficients + 20 acceleration MFCC coefficients = 60 values, calculated in 40 ms analysis window with 50% hop size
Machine learning: Gaussian mixture models, 16 Gaussians per class model

System hyperparameters

Application	Scene Classification (Task 1)	Binary SED (Task 2)	Multiclass SED (Task 3)
Audio input
Channels	1, in case of multiple audio channels, they are averaged into one channel.
Normalization	No audio normalization
Acoustic features (Librosa 0.5)
Type	MFCC (static, delta, acceleration)
Window length	40 ms
Hop length	20 ms
Mel bands	40
Number of coefficients	20
Delta window	9
Feature vector
Aggregation	mfcc+delta+acc	mfcc (0th omitted) + delta+acc
Length	60	59
Gaussian Mixtures (Sklearn GaussianMixture)
Number of Gaussians	16
Covariance	diagonal
Number of parameters	1936	1904
Modelling	One model per scene class	Model pair per event class, negative model and positive model.
Decision	Likelihood accumulation + maximum	Sliding likelihood accumulation + likelihood ratio + thresholding
Accumulation window	Signal length	0.5 sec	1.0 sec
Decision threshold		200	100

Processing blocks¶

Processing blocks of the system.

The system implements the following basic processing blocks:

Initialization, see initialize()

Prepares the dataset:

Downloads the dataset from the Internet if needed

Extracts the dataset package if needed

Makes sure that the meta files are appropriately formatted

Feature extraction, see feature_extraction()

Goes through all the training material and extracts the acoustic features

Features are stored file-by-file on the local disk (pickle files)

Feature normalization, see feature_normalization()

Goes through the training material in evaluation folds, and calculates global mean and std of the data per fold.

Stores the normalization factors per fold(pickle files)

System training, see system_training()

Loads normalizers

Loads training material file-by-file and forms the feature matrix, normalizes the matrix and optionally aggregates features

Trains the system with the features and metadata

Stores the trained acoustic models on the local disk (pickle files)

System testing, see system_testing()

Goes through the testing material and does the classification / detection

Stores the results (text files)

System evaluation, see system_evaluation()

Reads the ground truth and the output of the system and calculates evaluation metrics