Schmidt L, Peters N (2025)
Publication Type: Conference contribution
Publication year: 2025
Publisher: Institute of Electrical and Electronics Engineers Inc.
Conference Proceedings Title: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Event location: Hyderabad, IND
ISBN: 9798350368741
DOI: 10.1109/ICASSP49660.2025.10887669
Attention-based models have become tremendously successful in the last couple of years for tasks such as Acoustic Scene Classification, Event Classification, or Speaker Identification. They operate on a set of tokens extracted from audio features and scale quadratically in sequence length due to their pairwise operations in multi-head attention. In this paper, we propose to parameterize a Dirichlet prior with a Transformer model, to jointly estimate label and token probabilities. A token bottleneck, provided by a Categorical-Dirichlet pair, forces the model to concentrate on a subset of tokens. This allows for improved interpretability and higher audio throughput during inference. We compare two different methods for token sampling - full knowledge of all tokens and partial token sampling for reduced complexity. We evaluate and interpret typical audio datasets such as the Environmental Sound Classification (ESC-50) dataset, the TAU Urban Acoustic Scenes 2020 (TAU20) dataset, and the Speech Command version 2 (SCv2) dataset. The results show that the token budget can be reduced without significant performance loss, especially for acoustic scene classification. We show that for the ESC-50 and SCv2 datasets, the token relevance can be well approximated with partial token view. Finally, we show that a significant increase in throughput can be achieved with our proposed methods.
APA:
Schmidt, L., & Peters, N. (2025). Concentrating Harder for Faster Audio Transformer. In Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta (Eds.), ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Hyderabad, IND: Institute of Electrical and Electronics Engineers Inc..
MLA:
Schmidt, Lorenz, and Nils Peters. "Concentrating Harder for Faster Audio Transformer." Proceedings of the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025, Hyderabad, IND Ed. Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta, Institute of Electrical and Electronics Engineers Inc., 2025.
BibTeX: Download