.. _sound_event:

Sound Event Detection
=====================

The task of sound event detection involves locating and classifying
sounds in audio recordings - estimating onset and offset for distinct
sound event instances and providing a textual descriptor for each. The
usual approach for this problem is supervised learning with sound event
classes defined in advance.

Metrics are defined for polyphonic sound event detection, in which the
ground truth and system output contain overlapping sound event instances.

Two types of metrics are implemented:

- **segment-based metrics**  - the ground truth and system output are compared in a fixed time grid; sound events are marked as active or inactive in each segment;
- **event-based metrics** - the ground truth and system output are compared at event instance level;

Intermediate statistics
-----------------------

Segment-based
^^^^^^^^^^^^^

- *true positive*: the ground truth and system output both indicate an event to be active in that segment
- *false positive*: the ground truth indicates an event to be inactive in that segment, but the system output indicates it as active
- *false negative*: the ground truth indicates an event to be active in that segment, but the system output indicates it as inactive.
- *true negative*: the ground truth and system output both indicate an event to be inactive.

Segment-based metrics implementation allow selecting the desired segment length for evaluation (see ``time_resolution`` parameter).

Event-based
^^^^^^^^^^^

- *true positive*: an event in the system output that has a temporal position overlapping with the temporal position of an event with the same label in the ground truth. A *collar* is usually allowed for the onset and offset, or a tolerance with respect to the ground truth event duration.
- *false positive*: an event in the system output that has no correspondence to an event with same label in the ground truth within the allowed tolerance;
- *false negative*: an event in the ground truth that has no correspondence to an event with same label in the system output within the allowed tolerance.
- *true negative*: event-based metrics have no meaningful true negatives.

Event-based metrics implementation allow selecting the desired collar size (see ``t_collar`` parameter) and use of onset only or onset and offset conditions for evaluation (see ``evaluate_onset`` and ``evaluate_offset`` parameters).

.. _averaging:

Averaging
^^^^^^^^^


**Micro-averaging** - intermediate statistics are aggregated over all test
data, then metrics are calculated; each instance has equal influence on
the final metric value;

**Macro-averaging** - intermediate statistics are aggregated class-wise,
class-based metrics are calculated, then average of class based metrics;
each class has equal influence on the final metric value.

Micro and macro averages can result in very different values when
classes are highly unbalanced or performance on individual classes is
very different.

Cross-validation
^^^^^^^^^^^^^^^^

Recommended calculation for a cross-validation setup is to run all
train/test folds and perform evaluation at the end (no fold-wise
evaluation!).
The reason is that folds are most often unbalanced due to the multilabel
nature of the problem, and this results in biases when averaging. For
more details, consult [1]_.


Implemented metrics
-------------------

Precision, Recall and F-score
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. math::

    P=\frac{TP}{TP+FP},\quad R=\frac{TP}{TP+FN},\quad F=\frac{2 \cdot P \cdot R}{P+R}


These can be calculated segment based or event based, micro or macro averaged.

Sensitivity and specificity
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. math::
    Sensitivity = \frac{TP}{TP+FN},\quad
    Specificity = \frac{TN}{TN+FP}

Accuracy
^^^^^^^^

.. math::
    accuracy = \frac{TP+TN}{TP+TN+FP+FN}

.. math::
    accuracy2 = \frac{TP}{TP+FP+FN}

Balanced accuracy
^^^^^^^^^^^^^^^^^

.. math::
    BACC = factor \cdot \frac{TP}{TP+FN} +(1-factor) \cdot \frac{TN}{TN+FP}

Specificity and accuracy variants are only calculated as segment-based
metrics.


Error Rate
^^^^^^^^^^

**Segment-based**

**Substitutions** in segment *k* - *S(k)* - the number of ground truth events
for which a correct event was not output, yet something else was. One
substitution is equivalent to having one false positives and one false
negatives in the same segment. There is no need to designate which
erroneous event substitutes which.

**Insertions** in segment *k* - *I(k)* - events in system output that are not
correct (false positives after substitutions are accounted for).

**Deletions** in segment *k* - *D(k)* - events in ground truth that are not
correct (false positives after substitutions are accounted for).

.. math::
    &S(k) = min(FN(k),FP(k)) \nonumber \\
    &D(k) = max(0,FN(k)-FP(k)) \\
    &I(k)= max(0,FP(k)-FN(k)) \nonumber

.. math::
    ER=\frac{\sum_{k=1}^K{S(k)}+\sum_{k=1}^K{D(k)}+\sum_{k=1}^K{I(k)}}{\sum_{k=1}^K{N(k)}}

*N(k)* is the number of events in segment *k* in ground truth.

**Event-based**

**Substitutions** - events in system output with correct temporal position
but incorrect class label

**Insertions** - events in system output unaccounted for as correct or
substituted

**Deletions** - events in ground truth unaccounted for as correct or substituted

.. math::
    ER=\frac{S + D + I}{N}

*N* is the total number of events in ground truth.

Code
----

.. automodule:: sed_eval.sound_event

Rerefences
----------

.. [1] Forman, G. and Scholz, M.
       "Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier Performance Measurement".
       SIGKDD Explor. Newsl. 12, 1, November 2010, pp. 49-57.
       http://kdd.org/exploration_files/v12-1-p49-forman-sigkdd.pdf