Sound Event Detection¶
The task of sound event detection involves locating and classifying sounds in audio recordings - estimating onset and offset for distinct sound event instances and providing a textual descriptor for each. The usual approach for this problem is supervised learning with sound event classes defined in advance.
Metrics are defined for polyphonic sound event detection, in which the ground truth and system output contain overlapping sound event instances.
Two types of metrics are implemented:
- segment-based metrics - the ground truth and system output are compared in a fixed time grid; sound events are marked as active or inactive in each segment;
- event-based metrics - the ground truth and system output are compared at event instance level;
Intermediate statistics¶
Segment-based¶
- true positive: the ground truth and system output both indicate an event to be active in that segment
- false positive: the ground truth indicates an event to be inactive in that segment, but the system output indicates it as active
- false negative: the ground truth indicates an event to be active in that segment, but the system output indicates it as inactive.
- true negative: the ground truth and system output both indicate an event to be inactive.
Segment-based metrics implementation allow selecting the desired segment length for evaluation (see time_resolution
parameter).
Event-based¶
- true positive: an event in the system output that has a temporal position overlapping with the temporal position of an event with the same label in the ground truth. A collar is usually allowed for the onset and offset, or a tolerance with respect to the ground truth event duration.
- false positive: an event in the system output that has no correspondence to an event with same label in the ground truth within the allowed tolerance;
- false negative: an event in the ground truth that has no correspondence to an event with same label in the system output within the allowed tolerance.
- true negative: event-based metrics have no meaningful true negatives.
Event-based metrics implementation allow selecting the desired collar size (see t_collar
parameter) and use of onset only or onset and offset conditions for evaluation (see evaluate_onset
and evaluate_offset
parameters).
Averaging¶
Micro-averaging - intermediate statistics are aggregated over all test data, then metrics are calculated; each instance has equal influence on the final metric value;
Macro-averaging - intermediate statistics are aggregated class-wise, class-based metrics are calculated, then average of class based metrics; each class has equal influence on the final metric value.
Micro and macro averages can result in very different values when classes are highly unbalanced or performance on individual classes is very different.
Cross-validation¶
Recommended calculation for a cross-validation setup is to run all train/test folds and perform evaluation at the end (no fold-wise evaluation!). The reason is that folds are most often unbalanced due to the multilabel nature of the problem, and this results in biases when averaging. For more details, consult [1].
Implemented metrics¶
Precision, Recall and F-score¶
These can be calculated segment based or event based, micro or macro averaged.
Sensitivity and specificity¶
Balanced accuracy¶
Specificity and accuracy variants are only calculated as segment-based metrics.
Error Rate¶
Segment-based
Substitutions in segment k - S(k) - the number of ground truth events for which a correct event was not output, yet something else was. One substitution is equivalent to having one false positives and one false negatives in the same segment. There is no need to designate which erroneous event substitutes which.
Insertions in segment k - I(k) - events in system output that are not correct (false positives after substitutions are accounted for).
Deletions in segment k - D(k) - events in ground truth that are not correct (false positives after substitutions are accounted for).
N(k) is the number of events in segment k in ground truth.
Event-based
Substitutions - events in system output with correct temporal position but incorrect class label
Insertions - events in system output unaccounted for as correct or substituted
Deletions - events in ground truth unaccounted for as correct or substituted
N is the total number of events in ground truth.
Code¶
Segment-based metrics, main functions:
sed_eval.sound_event.SegmentBasedMetrics.evaluate
: Calculate intermediate values for evaluation and accumulate them.sed_eval.sound_event.SegmentBasedMetrics.results
: Calculate and return all metrics.sed_eval.sound_event.SegmentBasedMetrics.results_overall_metrics
: Calculate and return overall metrics (micro-averaged).sed_eval.sound_event.SegmentBasedMetrics.results_class_wise_metrics
: Calculate and return class-wise metrics.sed_eval.sound_event.SegmentBasedMetrics.results_class_wise_average_metrics
: Calculate and return class-wise average metrics (macro-averaged).
Event-based metrics, main functions:
sed_eval.sound_event.EventBasedMetrics.evaluate
: Calculate intermediate values for evaluation and accumulate them.sed_eval.sound_event.EventBasedMetrics.results
: Calculate and return all metrics.sed_eval.sound_event.EventBasedMetrics.results_overall_metrics
: Calculate and return overall metrics (micro-averaged).sed_eval.sound_event.EventBasedMetrics.results_class_wise_metrics
: Calculate and return class-wise metrics.sed_eval.sound_event.EventBasedMetrics.results_class_wise_average_metrics
: Calculate and return class-wise average metrics (macro-averaged).
Functions sed_eval.sound_event.SegmentBasedMetrics.evaluate
and sed_eval.sound_event.EventBasedMetrics.evaluate
take as a parameter event lists, use sed_eval.io.load_event_list
to read them from a file.
Usage example when reading event lists from disk (you can run example in path tests/data/sound_event
):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | import sed_eval
import dcase_util
file_list = [
{
'reference_file': 'office_snr0_high_v2.txt',
'estimated_file': 'office_snr0_high_v2_detected.txt'
},
{
'reference_file': 'office_snr0_med_v2.txt',
'estimated_file': 'office_snr0_med_v2_detected.txt'
}
]
data = []
# Get used event labels
all_data = dcase_util.containers.MetaDataContainer()
for file_pair in file_list:
reference_event_list = sed_eval.io.load_event_list(
filename=file_pair['reference_file']
)
estimated_event_list = sed_eval.io.load_event_list(
filename=file_pair['estimated_file']
)
data.append({'reference_event_list': reference_event_list,
'estimated_event_list': estimated_event_list})
all_data += reference_event_list
event_labels = all_data.unique_event_labels
# Start evaluating
# Create metrics classes, define parameters
segment_based_metrics = sed_eval.sound_event.SegmentBasedMetrics(
event_label_list=event_labels,
time_resolution=1.0
)
event_based_metrics = sed_eval.sound_event.EventBasedMetrics(
event_label_list=event_labels,
t_collar=0.250
)
# Go through files
for file_pair in data:
segment_based_metrics.evaluate(
reference_event_list=file_pair['reference_event_list'],
estimated_event_list=file_pair['estimated_event_list']
)
event_based_metrics.evaluate(
reference_event_list=file_pair['reference_event_list'],
estimated_event_list=file_pair['estimated_event_list']
)
# Get only certain metrics
overall_segment_based_metrics = segment_based_metrics.results_overall_metrics()
print("Accuracy:", overall_segment_based_metrics['accuracy']['accuracy'])
# Or print all metrics as reports
print(segment_based_metrics)
print(event_based_metrics)
|
Usage example to evaluate results stored in variables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | import sed_eval
import dcase_util
reference_event_list = dcase_util.containers.MetaDataContainer(
[
{
'event_label': 'car',
'event_onset': 0.0,
'event_offset': 2.5,
'file': 'audio/street/b099.wav',
'scene_label': 'street'
},
{
'event_label': 'car',
'event_onset': 2.8,
'event_offset': 4.5,
'file': 'audio/street/b099.wav',
'scene_label': 'street'
},
{
'event_label': 'car',
'event_onset': 6.0,
'event_offset': 10.0,
'file': 'audio/street/b099.wav',
'scene_label': 'street'
}
]
)
estimated_event_list = dcase_util.containers.MetaDataContainer(
[
{
'event_label': 'car',
'event_onset': 1.0,
'event_offset': 3.5,
'file': 'audio/street/b099.wav',
'scene_label': 'street'
},
{
'event_label': 'car',
'event_onset': 7.0,
'event_offset': 8.0,
'file': 'audio/street/b099.wav',
'scene_label': 'street'
}
]
)
segment_based_metrics = sed_eval.sound_event.SegmentBasedMetrics(
event_label_list=reference_event_list.unique_event_labels,
time_resolution=1.0
)
event_based_metrics = sed_eval.sound_event.EventBasedMetrics(
event_label_list=reference_event_list.unique_event_labels,
t_collar=0.250
)
for filename in reference_event_list.unique_files:
reference_event_list_for_current_file = reference_event_list.filter(
filename=filename
)
estimated_event_list_for_current_file = estimated_event_list.filter(
filename=filename
)
segment_based_metrics.evaluate(
reference_event_list=reference_event_list_for_current_file,
estimated_event_list=estimated_event_list_for_current_file
)
event_based_metrics.evaluate(
reference_event_list=reference_event_list_for_current_file,
estimated_event_list=estimated_event_list_for_current_file
)
# Get only certain metrics
overall_segment_based_metrics = segment_based_metrics.results_overall_metrics()
print("Accuracy:", overall_segment_based_metrics['accuracy']['accuracy'])
# Or print all metrics as reports
print(segment_based_metrics)
print(event_based_metrics)
|
Segment based metrics¶
SegmentBasedMetrics (event_label_list[, ...]) |
Constructor |
SegmentBasedMetrics.evaluate (...[, ...]) |
Evaluate file pair (reference and estimated) |
SegmentBasedMetrics.results () |
All metrics |
SegmentBasedMetrics.results_overall_metrics () |
Overall metrics |
SegmentBasedMetrics.results_class_wise_metrics () |
Class-wise metrics |
SegmentBasedMetrics.results_class_wise_average_metrics () |
Class-wise averaged metrics |
SegmentBasedMetrics.result_report_parameters () |
Report metric parameters |
SegmentBasedMetrics.result_report_overall () |
Report overall results |
SegmentBasedMetrics.result_report_class_wise () |
Report class-wise results |
SegmentBasedMetrics.result_report_class_wise_average () |
Report class-wise averages |
SegmentBasedMetrics.reset () |
Reset internal state |
-
class
sed_eval.sound_event.
SegmentBasedMetrics
(event_label_list, time_resolution=1.0)[source]¶ Constructor
Parameters: event_label_list : list, numpy.array
List of unique event labels
time_resolution : float (0,]
Segment size used in the evaluation, in seconds. Default value 1.0
-
evaluate
(reference_event_list, estimated_event_list, evaluated_length_seconds=None)[source]¶ Evaluate file pair (reference and estimated)
Parameters: reference_event_list : list of dict or dcase_util.containers.MetaDataContainer
Reference event list.
estimated_event_list : list of dict or dcase_util.containers.MetaDataContainer
Estimated event list.
evaluated_length_seconds : float, optional
Evaluated length. If none given, maximum offset is used. Default value None
Returns: self
-
overall_f_measure
()[source]¶ Overall f-measure metrics (f_measure, precision, and recall)
Returns: dict
results in a dictionary format
-
overall_error_rate
()[source]¶ Overall error rate metrics (error_rate, substitution_rate, deletion_rate, and insertion_rate)
Returns: dict
results in a dictionary format
-
overall_accuracy
(factor=0.5)[source]¶ Overall accuracy metrics (sensitivity, specificity, accuracy, and balanced_accuracy)
Parameters: factor : float [0-1]
Balance factor. Default value 0.5
Returns: dict
results in a dictionary format
-
class_wise_count
(event_label)[source]¶ Class-wise counts (Nref and Nsys)
Returns: dict
results in a dictionary format
-
class_wise_f_measure
(event_label)[source]¶ Class-wise f-measure metrics (f_measure, precision, and recall)
Returns: dict
results in a dictionary format
-
class_wise_error_rate
(event_label)[source]¶ Class-wise error rate metrics (error_rate, deletion_rate, and insertion_rate)
Returns: dict
results in a dictionary format
-
Event based metrics¶
EventBasedMetrics (event_label_list[, ...]) |
Constructor |
EventBasedMetrics.evaluate (...) |
Evaluate file pair (reference and estimated) |
EventBasedMetrics.results () |
All metrics |
EventBasedMetrics.results_overall_metrics () |
Overall metrics |
EventBasedMetrics.results_class_wise_metrics () |
Class-wise metrics |
EventBasedMetrics.results_class_wise_average_metrics () |
Class-wise averaged metrics |
EventBasedMetrics.result_report_parameters () |
Report metric parameters |
EventBasedMetrics.result_report_overall () |
Report overall results |
EventBasedMetrics.result_report_class_wise () |
Report class-wise results |
EventBasedMetrics.result_report_class_wise_average () |
Report class-wise averages |
EventBasedMetrics.reset () |
Reset internal state |
-
class
sed_eval.sound_event.
EventBasedMetrics
(event_label_list, evaluate_onset=True, evaluate_offset=True, t_collar=0.2, percentage_of_length=0.5, event_matching_type='optimal', **kwargs)[source]¶ Constructor
Parameters: event_label_list : list
List of unique event labels
evaluate_onset : bool
Evaluate onset. Default value True
evaluate_offset : bool
Evaluate offset. Default value True
t_collar : float (0,]
Time collar used when evaluating validity of the onset and offset, in seconds. Default value 0.2
percentage_of_length : float in [0, 1]
Second condition, percentage of the length within which the estimated offset has to be in order to be consider valid estimation. Default value 0.5
event_matching_type : str
Event matching type. Set ‘optimal’ for graph-based matching, or ‘greedy’ for always select first found match. Greedy type of event matching is kept for backward compatibility. Both event matching types produce very similar results, however, greedy matching can be sensitive to the order of reference events. Use default ‘optimal’ event matching, if you do not intend to compare your results to old results. Default value ‘optimal’
-
evaluate
(reference_event_list, estimated_event_list)[source]¶ Evaluate file pair (reference and estimated)
Parameters: reference_event_list : event list
Reference event list
estimated_event_list : event list
Estimated event list
Returns: self
-
static
validate_onset
(reference_event, estimated_event, t_collar=0.2)[source]¶ Validate estimated event based on event onset
Parameters: reference_event : dict
Reference event.
estimated_event: dict
Estimated event.
t_collar : float > 0, seconds
Time collar with which the estimated onset has to be in order to be consider valid estimation. Default value 0.2
Returns: bool
-
static
validate_offset
(reference_event, estimated_event, t_collar=0.2, percentage_of_length=0.5)[source]¶ Validate estimated event based on event offset
Parameters: reference_event : dict
Reference event.
estimated_event : dict
Estimated event.
t_collar : float > 0, seconds
First condition, Time collar with which the estimated offset has to be in order to be consider valid estimation. Default value 0.2
percentage_of_length : float in [0, 1]
Second condition, percentage of the length within which the estimated offset has to be in order to be consider valid estimation. Default value 0.5
Returns: bool
-
overall_f_measure
()[source]¶ Overall f-measure metrics (f_measure, precision, and recall)
Returns: dict
results in a dictionary format
-
overall_error_rate
()[source]¶ Overall error rate metrics (error_rate, substitution_rate, deletion_rate, and insertion_rate)
Returns: dict
results in a dictionary format
-
class_wise_count
(event_label)[source]¶ Class-wise counts (Nref and Nsys)
Returns: dict
results in a dictionary format
-
class_wise_f_measure
(event_label)[source]¶ Class-wise f-measure metrics (f_measure, precision, and recall)
Returns: dict
results in a dictionary format
-
class_wise_accuracy
(event_label)¶
-
class_wise_error_rate
(event_label)[source]¶ Class-wise error rate metrics (error_rate, deletion_rate, and insertion_rate)
Returns: dict
results in a dictionary format
-
overall_accuracy
(factor=0.5)¶
-
result_report_class_wise
()¶ Report class-wise results
Returns: str
result report in string format
-
result_report_class_wise_average
()¶ Report class-wise averages
Returns: str
result report in string format
-
result_report_overall
()¶ Report overall results
Returns: str
result report in string format
-
results
()¶ All metrics
Returns: dict
results in a dictionary format
-
results_class_wise_average_metrics
()¶ Class-wise averaged metrics
Returns: dict
results in a dictionary format
-
results_class_wise_metrics
()¶ Class-wise metrics
Returns: dict
results in a dictionary format
-
results_overall_metrics
()¶ Overall metrics
Returns: dict
results in a dictionary format
-
Rerefences¶
[1] | Forman, G. and Scholz, M. “Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier Performance Measurement”. SIGKDD Explor. Newsl. 12, 1, November 2010, pp. 49-57. http://kdd.org/exploration_files/v12-1-p49-forman-sigkdd.pdf |