Inventor(s)

D ShinFollow

Abstract

Deep transformer decoder architectures utilized in large language models (LLMs) perform a series of non-linear operations on input tokens to predict the most likely next tokens. Executing a fixed number of decoder operations to compute the output tokens means that all queries have the same or similar computational cost. This disclosure describes a language model framework that performs autoregressive output at several intermediate layers, with training including a net loss for token completion at all inference layers to allow for early abort for simple queries. An arbiter network is provided that analyzes the input query and outputs a one-hot vector with the index indicating a particular decoder block of the model. Decoder computations are performed only up to the identified block and a response is provided based on the tokens output by the block. For a network with n decoder blocks, only complex queries require execution of all n blocks, while simpler queries are responded by execution of fewer blocks. On average, because of the opportunistic early abort, the described framework can improve user query responsiveness and reduce computational costs. Scalability of LLMs is improved through adaptive computation based on query complexity.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS