Abstract

Deep transformer decoder architectures utilized in large language models (LLMs) perform a series of non-linear operations on input tokens to predict the most likely next tokens. Executing a fixed number of decoder operations to compute the output tokens means that all queries have the same or similar computational cost. This disclosure describes a language model framework that performs autoregressive output at several intermediate layers, with training including a net loss for token completion at all inference layers to allow for early abort for simple queries. An arbiter network is provided that analyzes the input query and outputs a one-hot vector with the index indicating a particular decoder block of the model. Decoder computations are performed only up to the identified block and a response is provided based on the tokens output by the block. For a network with n decoder blocks, only complex queries require execution of all n blocks, while simpler queries are responded by execution of fewer blocks. On average, because of the opportunistic early abort, the described framework can improve user query responsiveness and reduce computational costs. Scalability of LLMs is improved through adaptive computation based on query complexity.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "Fast LLM Inference by Early Autoregression Abort based on Query Complexity", Technical Disclosure Commons, (September 14, 2024)
https://www.tdcommons.org/dpubs_series/7358

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Fast LLM Inference by Early Autoregression Abort based on Query Complexity

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Fast LLM Inference by Early Autoregression Abort based on Query Complexity

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information