Concentrating Harder for Faster Audio Transformer

Schmidt L, Peters N (2025)


Publication Type: Conference contribution

Publication year: 2025

Journal

Publisher: Institute of Electrical and Electronics Engineers Inc.

Conference Proceedings Title: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Event location: Hyderabad, IND

ISBN: 9798350368741

DOI: 10.1109/ICASSP49660.2025.10887669

Abstract

Attention-based models have become tremendously successful in the last couple of years for tasks such as Acoustic Scene Classification, Event Classification, or Speaker Identification. They operate on a set of tokens extracted from audio features and scale quadratically in sequence length due to their pairwise operations in multi-head attention. In this paper, we propose to parameterize a Dirichlet prior with a Transformer model, to jointly estimate label and token probabilities. A token bottleneck, provided by a Categorical-Dirichlet pair, forces the model to concentrate on a subset of tokens. This allows for improved interpretability and higher audio throughput during inference. We compare two different methods for token sampling - full knowledge of all tokens and partial token sampling for reduced complexity. We evaluate and interpret typical audio datasets such as the Environmental Sound Classification (ESC-50) dataset, the TAU Urban Acoustic Scenes 2020 (TAU20) dataset, and the Speech Command version 2 (SCv2) dataset. The results show that the token budget can be reduced without significant performance loss, especially for acoustic scene classification. We show that for the ESC-50 and SCv2 datasets, the token relevance can be well approximated with partial token view. Finally, we show that a significant increase in throughput can be achieved with our proposed methods.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Schmidt, L., & Peters, N. (2025). Concentrating Harder for Faster Audio Transformer. In Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta (Eds.), ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Hyderabad, IND: Institute of Electrical and Electronics Engineers Inc..

MLA:

Schmidt, Lorenz, and Nils Peters. "Concentrating Harder for Faster Audio Transformer." Proceedings of the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025, Hyderabad, IND Ed. Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta, Institute of Electrical and Electronics Engineers Inc., 2025.

BibTeX: Download