Audio Event Detection (AED) and Clustering Analysis


The Audio Event Detection (AED) and Clustering analyses aim to automatically detect and categorize sounds in large audio datasets without supervision.

The pipeline consists of two main steps that 1) automatically detect relevant sounds in raw field recordings and 2) cluster these sounds based on feature similarities. The resulting clusters can be explored and identified using visual inspection and audio playback.

This analysis provides an efficient way to summarize and explore the sound categories in an audio dataset. The potential uses include:

  • Quickly identifying communities of species
  • Estimating species richness and composition
  • Discovering unknown sound categories
  • Quickly searching for examples of a desired signal/call, without the need for any existing examples
  • Collecting training data for supervised audio recognition models
    • Pattern Matching can be used after AED & Clustering to efficiently detect more examples of a desired sound

Step 1. Audio Event Detection

You must create a playlist before starting Audio Event Detection (AED). 

For testing out AED parameters, please use a small playlist (e.g. 20-50 recordings) to conserve resources and keep Arbimon free.


Include both nocturnal and diurnal recordings to explore how AED performs against different taxonomic groups.


a) Go to Analysis > Audio Event Detection analysis and click the + New Job button

b) Set the parameters for the AED job

  • Job Name - A name to identify the job by.
  • Playlist - The name of the playlist to analyze. For testing different parameters, please use a small test playlist (e.g. 20-50 recordings) to conserve resources and keep Arbimon free.
  • Amplitude Threshold - The minimum amplitude of detected events. The Amplitude Threshold will set the minimum amplitude of sound events to detect and is likely the most critical parameter to tune. It is measured in standard deviations from the mean amplitude of the spectrogram after de-noising. Considering the empirical rule, appropriate values are likely between within 1-3. 
  • Min/Max Frequency - The minimum and maximum frequency of sound events to detect (kHz).
  • Additional parameters (under Details
    • Duration threshold - Minimum duration (seconds)
    • Bandwidth threshold - Minimum frequency bandwidth (kHz)
    • Area threshold - The minimum area of an audio event in (kHz * seconds)
    • Filter size - The size of the filter used to de-noise the spectrogram (pixels). Increasing/decreasing this can expand/tighten the boundaries of events. For example, this can be used to determine whether adjacent syllables of a call are detected as a single event (increased filter size) or separate events (decreased filter size).


Determine your main ecological questions. For example, are you interested in detecting all species/sounds or those in a specific frequency range?

c) Click the green Run button to queue the job. You can check the status of your AED job on the Jobs page.

Once your AED job is completed, return to Analysis > Audio Event Detection. Click on the box icon at the end of the line (View in the Visualizer) to open the Visualizer.

Once in the Visualizer click on the Audio Events tab. You will see the list of AED jobs that have processed the current recording. You can activate/deactivate the eye icons next to each AED job to visualize the audio event detection boxes from different jobs. In this way, the effects of different parameters can be compared.

After finding parameters that are accurately detecting sounds of interest in a relatively small playlist, analyze your entire dataset.

Note: There is currently a playlist size limit of 2,000 recordings.

Step 2. Clustering

When the AED job is completed, Clustering analysis can be applied to automatically group the detected events into categories based on their similarities. These categories can then be easily explored.


a) Go to Clustering analysis and click the + New Job button

b) Set the parameters for the Clustering job

  • Audio Event Detection Job - The name of the Audio Event Detection job you would like to cluster the results of
  • Distance Threshold - This parameter will likely have the most significant impact on results. It is the maximum allowed distance between neighboring points in a cluster. In summary, smaller values will result in smaller, more homogeneous clusters, and vice versa (see Table 1). Based on our feature set and distance function used to compare AEDs, appropriate values are likely within the range [0.05, 0.2]. See Parameter Examples below.
  • Min. Points - The lowest number of points required to form a cluster. If small clusters (e.g. rare sounds) are of interest, it is recommended to keep this at a low value (<=5). Larger values will restrict the results to more densely populated clusters.

c) Click the  Run button to queue the job.

Parameter Examples

Distance Threshold Similarity of audio events in a Cluster Number of Clusters Size of Clusters
Lower Higher Higher Lower
Higher Lower Lower Higher

Table: Effect of the Distance Threshold parameter

Figure: Example random 2-dimensional data with resulting clusters, varying Distance Threshold (Eps) from 0.05 to 0.15 and holding Min. Points = 5. Grey points are noise, other points are color-coded by cluster.

Figure: Example random 2-dimensional data with resulting clusters, varying Min. Points from 2 to 300 and holding Distance Threshold = 0.1

For more details about the clustering algorithm and parameters, see the original DBSCAN paper.

Inspecting Clusters

To visualize and explore the results of a Clustering job, go to Analysis > Clustering. Select Show Details next to a job in the list (right side).

This will open a scatter plot that visualizes the clustered audio events. Each point in the scatter plot represents an audio event, and their color indicates their assigned cluster (category). Nearby points/clusters are more similar than ones that are distant from each other.

By selecting the filter icon in the upper right, the displayed audio events can be filtered to those within a specific frequency range. 

Click the lasso or box selection tools to select specific audio events/clusters in the scatter plot.

After selecting a region of the plot, you will have the option to move to the Context View or Grid View (seen in the upper right).

Likely the most useful option will be the Grid View, which displays images of the audio event spectrograms in a grid, grouped by cluster (next figure below). Audio events can also be grouped by site and date. Alternatively, the Context View will open the set of recordings containing the selected audio events in the Visualizer.

Finally, all clusters can also be visualized in the Grid view right away by selecting View All Clusters (upper right).

Validating Results

In the Grid View, select audio events you want to validate. Input a species name, and select a songtype from the dropdown next to the Validate as area.

 A playlist of the recordings containing the selected audio events can also be saved by selecting the note symbol.

All validations can be visualized and explored in Arbimon Insights.