Sound Event Detection

The task of sound event detection involves locating and classifying sounds in audio recordings - estimating onset and offset for distinct sound event instances and providing a textual descriptor for each. The usual approach for this problem is supervised learning with sound event classes defined in advance.

Metrics are defined for polyphonic sound event detection, in which the ground truth and system output contain overlapping sound event instances.

Two types of metrics are implemented:

  • segment-based metrics - the ground truth and system output are compared in a fixed time grid; sound events are marked as active or inactive in each segment;
  • event-based metrics - the ground truth and system output are compared at event instance level;

Intermediate statistics

Segment-based

  • true positive: the ground truth and system output both indicate an event to be active in that segment
  • false positive: the ground truth indicates an event to be inactive in that segment, but the system output indicates it as active
  • false negative: the ground truth indicates an event to be active in that segment, but the system output indicates it as inactive.
  • true negative: the ground truth and system output both indicate an event to be inactive.

Segment-based metrics implementation allow selecting the desired segment length for evaluation (see time_resolution parameter).

Event-based

  • true positive: an event in the system output that has a temporal position overlapping with the temporal position of an event with the same label in the ground truth. A collar is usually allowed for the onset and offset, or a tolerance with respect to the ground truth event duration.
  • false positive: an event in the system output that has no correspondence to an event with same label in the ground truth within the allowed tolerance;
  • false negative: an event in the ground truth that has no correspondence to an event with same label in the system output within the allowed tolerance.
  • true negative: event-based metrics have no meaningful true negatives.

Event-based metrics implementation allow selecting the desired collar size (see t_collar parameter) and use of onset only or onset and offset conditions for evaluation (see evaluate_onset and evaluate_offset parameters).

Averaging

Micro-averaging - intermediate statistics are aggregated over all test data, then metrics are calculated; each instance has equal influence on the final metric value;

Macro-averaging - intermediate statistics are aggregated class-wise, class-based metrics are calculated, then average of class based metrics; each class has equal influence on the final metric value.

Micro and macro averages can result in very different values when classes are highly unbalanced or performance on individual classes is very different.

Cross-validation

Recommended calculation for a cross-validation setup is to run all train/test folds and perform evaluation at the end (no fold-wise evaluation!). The reason is that folds are most often unbalanced due to the multilabel nature of the problem, and this results in biases when averaging. For more details, consult [1].

Implemented metrics

Precision, Recall and F-score

\[P=\frac{TP}{TP+FP},\quad R=\frac{TP}{TP+FN},\quad F=\frac{2 \cdot P \cdot R}{P+R}\]

These can be calculated segment based or event based, micro or macro averaged.

Sensitivity and specificity

\[Sensitivity = \frac{TP}{TP+FN},\quad Specificity = \frac{TN}{TN+FP}\]

Accuracy

\[accuracy = \frac{TP+TN}{TP+TN+FP+FN}\]
\[accuracy2 = \frac{TP}{TP+FP+FN}\]

Balanced accuracy

\[BACC = factor \cdot \frac{TP}{TP+FN} +(1-factor) \cdot \frac{TN}{TN+FP}\]

Specificity and accuracy variants are only calculated as segment-based metrics.

Error Rate

Segment-based

Substitutions in segment k - S(k) - the number of ground truth events for which a correct event was not output, yet something else was. One substitution is equivalent to having one false positives and one false negatives in the same segment. There is no need to designate which erroneous event substitutes which.

Insertions in segment k - I(k) - events in system output that are not correct (false positives after substitutions are accounted for).

Deletions in segment k - D(k) - events in ground truth that are not correct (false positives after substitutions are accounted for).

\[\begin{split}&S(k) = min(FN(k),FP(k)) \nonumber \\ &D(k) = max(0,FN(k)-FP(k)) \\ &I(k)= max(0,FP(k)-FN(k)) \nonumber\end{split}\]
\[ER=\frac{\sum_{k=1}^K{S(k)}+\sum_{k=1}^K{D(k)}+\sum_{k=1}^K{I(k)}}{\sum_{k=1}^K{N(k)}}\]

N(k) is the number of events in segment k in ground truth.

Event-based

Substitutions - events in system output with correct temporal position but incorrect class label

Insertions - events in system output unaccounted for as correct or substituted

Deletions - events in ground truth unaccounted for as correct or substituted

\[ER=\frac{S + D + I}{N}\]

N is the total number of events in ground truth.

Code

Segment-based metrics, main functions:

Event-based metrics, main functions:

Functions sed_eval.sound_event.SegmentBasedMetrics.evaluate and sed_eval.sound_event.EventBasedMetrics.evaluate take as a parameter event lists, use sed_eval.io.load_event_list to read them from a file.

Usage example when reading event lists from disk (you can run example in path tests/data/sound_event):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import sed_eval
import dcase_util

file_list = [
    {
     'reference_file': 'office_snr0_high_v2.txt',
     'estimated_file': 'office_snr0_high_v2_detected.txt'
    },
    {
     'reference_file': 'office_snr0_med_v2.txt',
     'estimated_file': 'office_snr0_med_v2_detected.txt'
    }
]

data = []

# Get used event labels
all_data = dcase_util.containers.MetaDataContainer()
for file_pair in file_list:
    reference_event_list = sed_eval.io.load_event_list(
        filename=file_pair['reference_file']
    )
    estimated_event_list = sed_eval.io.load_event_list(
        filename=file_pair['estimated_file']
    )

    data.append({'reference_event_list': reference_event_list,
                 'estimated_event_list': estimated_event_list})

    all_data += reference_event_list

event_labels = all_data.unique_event_labels

# Start evaluating

# Create metrics classes, define parameters
segment_based_metrics = sed_eval.sound_event.SegmentBasedMetrics(
    event_label_list=event_labels,
    time_resolution=1.0
)

event_based_metrics = sed_eval.sound_event.EventBasedMetrics(
    event_label_list=event_labels,
    t_collar=0.250
)

# Go through files
for file_pair in data:
    segment_based_metrics.evaluate(
        reference_event_list=file_pair['reference_event_list'],
        estimated_event_list=file_pair['estimated_event_list']
    )

    event_based_metrics.evaluate(
        reference_event_list=file_pair['reference_event_list'],
        estimated_event_list=file_pair['estimated_event_list']
    )

# Get only certain metrics
overall_segment_based_metrics = segment_based_metrics.results_overall_metrics()
print("Accuracy:", overall_segment_based_metrics['accuracy']['accuracy'])

# Or print all metrics as reports
print(segment_based_metrics)
print(event_based_metrics)

Usage example to evaluate results stored in variables:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import sed_eval
import dcase_util

reference_event_list = dcase_util.containers.MetaDataContainer(
    [
        {
            'event_label': 'car',
            'event_onset': 0.0,
            'event_offset': 2.5,
            'file': 'audio/street/b099.wav',
            'scene_label': 'street'
        },
        {
            'event_label': 'car',
            'event_onset': 2.8,
            'event_offset': 4.5,
            'file': 'audio/street/b099.wav',
            'scene_label': 'street'
        },
        {
            'event_label': 'car',
            'event_onset': 6.0,
            'event_offset': 10.0,
            'file': 'audio/street/b099.wav',
            'scene_label': 'street'
        }
    ]
)

estimated_event_list = dcase_util.containers.MetaDataContainer(
    [
        {
            'event_label': 'car',
            'event_onset': 1.0,
            'event_offset': 3.5,
            'file': 'audio/street/b099.wav',
            'scene_label': 'street'
        },
        {
            'event_label': 'car',
            'event_onset': 7.0,
            'event_offset': 8.0,
            'file': 'audio/street/b099.wav',
            'scene_label': 'street'
        }
    ]
)

segment_based_metrics = sed_eval.sound_event.SegmentBasedMetrics(
    event_label_list=reference_event_list.unique_event_labels,
    time_resolution=1.0
)
event_based_metrics = sed_eval.sound_event.EventBasedMetrics(
    event_label_list=reference_event_list.unique_event_labels,
    t_collar=0.250
)

for filename in reference_event_list.unique_files:
    reference_event_list_for_current_file = reference_event_list.filter(
        filename=filename
    )

    estimated_event_list_for_current_file = estimated_event_list.filter(
        filename=filename
    )

    segment_based_metrics.evaluate(
        reference_event_list=reference_event_list_for_current_file,
        estimated_event_list=estimated_event_list_for_current_file
    )

    event_based_metrics.evaluate(
        reference_event_list=reference_event_list_for_current_file,
        estimated_event_list=estimated_event_list_for_current_file
    )

# Get only certain metrics
overall_segment_based_metrics = segment_based_metrics.results_overall_metrics()
print("Accuracy:", overall_segment_based_metrics['accuracy']['accuracy'])

# Or print all metrics as reports
print(segment_based_metrics)
print(event_based_metrics)

Segment based metrics

SegmentBasedMetrics(event_label_list[, ...]) Constructor
SegmentBasedMetrics.evaluate(...[, ...]) Evaluate file pair (reference and estimated)
SegmentBasedMetrics.results() All metrics
SegmentBasedMetrics.results_overall_metrics() Overall metrics
SegmentBasedMetrics.results_class_wise_metrics() Class-wise metrics
SegmentBasedMetrics.results_class_wise_average_metrics() Class-wise averaged metrics
SegmentBasedMetrics.result_report_parameters() Report metric parameters
SegmentBasedMetrics.result_report_overall() Report overall results
SegmentBasedMetrics.result_report_class_wise() Report class-wise results
SegmentBasedMetrics.result_report_class_wise_average() Report class-wise averages
SegmentBasedMetrics.reset() Reset internal state
class sed_eval.sound_event.SegmentBasedMetrics(event_label_list, time_resolution=1.0)[source]

Constructor

Parameters:

event_label_list : list, numpy.array

List of unique event labels

time_resolution : float (0,]

Segment size used in the evaluation, in seconds. Default value 1.0

evaluate(reference_event_list, estimated_event_list, evaluated_length_seconds=None)[source]

Evaluate file pair (reference and estimated)

Parameters:

reference_event_list : list of dict or dcase_util.containers.MetaDataContainer

Reference event list.

estimated_event_list : list of dict or dcase_util.containers.MetaDataContainer

Estimated event list.

evaluated_length_seconds : float, optional

Evaluated length. If none given, maximum offset is used. Default value None

Returns:

self

reset()[source]

Reset internal state

overall_f_measure()[source]

Overall f-measure metrics (f_measure, precision, and recall)

Returns:

dict

results in a dictionary format

overall_error_rate()[source]

Overall error rate metrics (error_rate, substitution_rate, deletion_rate, and insertion_rate)

Returns:

dict

results in a dictionary format

overall_accuracy(factor=0.5)[source]

Overall accuracy metrics (sensitivity, specificity, accuracy, and balanced_accuracy)

Parameters:

factor : float [0-1]

Balance factor. Default value 0.5

Returns:

dict

results in a dictionary format

class_wise_count(event_label)[source]

Class-wise counts (Nref and Nsys)

Returns:

dict

results in a dictionary format

class_wise_f_measure(event_label)[source]

Class-wise f-measure metrics (f_measure, precision, and recall)

Returns:

dict

results in a dictionary format

class_wise_error_rate(event_label)[source]

Class-wise error rate metrics (error_rate, deletion_rate, and insertion_rate)

Returns:

dict

results in a dictionary format

class_wise_accuracy(event_label, factor=0.5)[source]

Class-wise accuracy metrics (sensitivity, specificity, accuracy, and balanced_accuracy)

Returns:

dict

results in a dictionary format

result_report_parameters()[source]

Report metric parameters

Returns:

str

result report in string format

Event based metrics

EventBasedMetrics(event_label_list[, ...]) Constructor
EventBasedMetrics.evaluate(...) Evaluate file pair (reference and estimated)
EventBasedMetrics.results() All metrics
EventBasedMetrics.results_overall_metrics() Overall metrics
EventBasedMetrics.results_class_wise_metrics() Class-wise metrics
EventBasedMetrics.results_class_wise_average_metrics() Class-wise averaged metrics
EventBasedMetrics.result_report_parameters() Report metric parameters
EventBasedMetrics.result_report_overall() Report overall results
EventBasedMetrics.result_report_class_wise() Report class-wise results
EventBasedMetrics.result_report_class_wise_average() Report class-wise averages
EventBasedMetrics.reset() Reset internal state
class sed_eval.sound_event.EventBasedMetrics(event_label_list, evaluate_onset=True, evaluate_offset=True, t_collar=0.2, percentage_of_length=0.5, event_matching_type='optimal', **kwargs)[source]

Constructor

Parameters:

event_label_list : list

List of unique event labels

evaluate_onset : bool

Evaluate onset. Default value True

evaluate_offset : bool

Evaluate offset. Default value True

t_collar : float (0,]

Time collar used when evaluating validity of the onset and offset, in seconds. Default value 0.2

percentage_of_length : float in [0, 1]

Second condition, percentage of the length within which the estimated offset has to be in order to be consider valid estimation. Default value 0.5

event_matching_type : str

Event matching type. Set ‘optimal’ for graph-based matching, or ‘greedy’ for always select first found match. Greedy type of event matching is kept for backward compatibility. Both event matching types produce very similar results, however, greedy matching can be sensitive to the order of reference events. Use default ‘optimal’ event matching, if you do not intend to compare your results to old results. Default value ‘optimal’

evaluate(reference_event_list, estimated_event_list)[source]

Evaluate file pair (reference and estimated)

Parameters:

reference_event_list : event list

Reference event list

estimated_event_list : event list

Estimated event list

Returns:

self

reset()[source]

Reset internal state

static validate_onset(reference_event, estimated_event, t_collar=0.2)[source]

Validate estimated event based on event onset

Parameters:

reference_event : dict

Reference event.

estimated_event: dict

Estimated event.

t_collar : float > 0, seconds

Time collar with which the estimated onset has to be in order to be consider valid estimation. Default value 0.2

Returns:

bool

static validate_offset(reference_event, estimated_event, t_collar=0.2, percentage_of_length=0.5)[source]

Validate estimated event based on event offset

Parameters:

reference_event : dict

Reference event.

estimated_event : dict

Estimated event.

t_collar : float > 0, seconds

First condition, Time collar with which the estimated offset has to be in order to be consider valid estimation. Default value 0.2

percentage_of_length : float in [0, 1]

Second condition, percentage of the length within which the estimated offset has to be in order to be consider valid estimation. Default value 0.5

Returns:

bool

overall_f_measure()[source]

Overall f-measure metrics (f_measure, precision, and recall)

Returns:

dict

results in a dictionary format

overall_error_rate()[source]

Overall error rate metrics (error_rate, substitution_rate, deletion_rate, and insertion_rate)

Returns:

dict

results in a dictionary format

class_wise_count(event_label)[source]

Class-wise counts (Nref and Nsys)

Returns:

dict

results in a dictionary format

class_wise_f_measure(event_label)[source]

Class-wise f-measure metrics (f_measure, precision, and recall)

Returns:

dict

results in a dictionary format

class_wise_accuracy(event_label)
class_wise_error_rate(event_label)[source]

Class-wise error rate metrics (error_rate, deletion_rate, and insertion_rate)

Returns:

dict

results in a dictionary format

overall_accuracy(factor=0.5)
result_report_class_wise()

Report class-wise results

Returns:

str

result report in string format

result_report_class_wise_average()

Report class-wise averages

Returns:

str

result report in string format

result_report_overall()

Report overall results

Returns:

str

result report in string format

results()

All metrics

Returns:

dict

results in a dictionary format

results_class_wise_average_metrics()

Class-wise averaged metrics

Returns:

dict

results in a dictionary format

results_class_wise_metrics()

Class-wise metrics

Returns:

dict

results in a dictionary format

results_overall_metrics()

Overall metrics

Returns:

dict

results in a dictionary format

result_report_parameters()[source]

Report metric parameters

Returns:

str

result report in string format

Rerefences

[1]Forman, G. and Scholz, M. “Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier Performance Measurement”. SIGKDD Explor. Newsl. 12, 1, November 2010, pp. 49-57. http://kdd.org/exploration_files/v12-1-p49-forman-sigkdd.pdf