Concentrating Harder for Faster Audio Transformer

Schmidt L, Peters N (2025)

Publication Type: Conference contribution

Publication year: 2025

Journal

2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Publisher: Institute of Electrical and Electronics Engineers Inc.

Conference Proceedings Title: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Event location: Hyderabad, IND

ISBN: 9798350368741

DOI: 10.1109/ICASSP49660.2025.10887669

Abstract

Attention-based models have become tremendously successful in the last couple of years for tasks such as Acoustic Scene Classification, Event Classification, or Speaker Identification. They operate on a set of tokens extracted from audio features and scale quadratically in sequence length due to their pairwise operations in multi-head attention. In this paper, we propose to parameterize a Dirichlet prior with a Transformer model, to jointly estimate label and token probabilities. A token bottleneck, provided by a Categorical-Dirichlet pair, forces the model to concentrate on a subset of tokens. This allows for improved interpretability and higher audio throughput during inference. We compare two different methods for token sampling - full knowledge of all tokens and partial token sampling for reduced complexity. We evaluate and interpret typical audio datasets such as the Environmental Sound Classification (ESC-50) dataset, the TAU Urban Acoustic Scenes 2020 (TAU20) dataset, and the Speech Command version 2 (SCv2) dataset. The results show that the token budget can be reduced without significant performance loss, especially for acoustic scene classification. We show that for the ESC-50 and SCv2 datasets, the token relevance can be well approximated with partial token view. Finally, we show that a significant increase in throughput can be achieved with our proposed methods.

Authors with CRIS profile

Lorenz Schmidt International Audio Laboratories Erlangen (AudioLabs)

Involved external institutions

Trinity College Dublin

Ireland (IE)

How to cite

APA:

Schmidt, L., & Peters, N. (2025). Concentrating Harder for Faster Audio Transformer. In Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta (Eds.), ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Hyderabad, IND: Institute of Electrical and Electronics Engineers Inc..

MLA:

Schmidt, Lorenz, and Nils Peters. "Concentrating Harder for Faster Audio Transformer." Proceedings of the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025, Hyderabad, IND Ed. Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta, Institute of Electrical and Electronics Engineers Inc., 2025.

BibTeX: Download