The cost of requests in an LLM is determined by incoming and outgoing tokens:
Choose a model based on the task:
When sending a request to an LLM, you can set a limit on the number of tokens in the response. It’s important to keep in mind: if the model’s response exceeds the set token limit, the excess portion will be truncated, and the response may lose its meaning.
When sending requests to an LLM from a script, the token limit is specified as one of the request parameters. When working with AI agents, the limit is specified in the settings for a specific model under the LLM Provider section of the agent’s parameters.
Prompt engineering is an effective method for reducing the number of tokens and associated costs. By creating clear, concise, and unambiguous prompts, you can guide the model on how to generate more effective responses. Eliminate redundant wording and unnecessary context that can increase the token count. Consider explicitly specifying the desired output length to the model — for example, by adding phrases like “Limit the response to two sentences” or “Provide a brief summary.” These simple instructions can significantly reduce the number of tokens in the output while maintaining the quality and relevance of the generated content.
Prompt optimization is an effective method for reducing the number of tokens used.
When crafting a prompt, clearly structure the instructions and remove any unnecessary or repetitive information. The shorter the prompt, the fewer input tokens will be consumed by the request.
You can also set limits on the size of the LLM’s response by adding a phrase like “The response must fit into a maximum of two sentences” to the prompt—this will reduce the cost of outgoing tokens.
In some cases, you can significantly reduce the length of the response: for example, if you need to use an LLM to determine the category to which an input phrase belongs, you can specify in the prompt that the LLM should return only the category number instead of the full name, and prohibit it from providing any information other than the category number, while further processing the response using conditions in the bot’s script. In this case, such strict restrictions allow you to significantly reduce the number of output tokens used.
If the LLM’s task is to determine a category, identify the presence of information in a phrase, recognize intent, or perform another action where there is a limited number of possible outcomes, you should consider setting similar restrictions.
For certain tasks, you can use API requests to external systems—this may be more efficient than spending a large number of tokens on an LLM request.
When working with requests that return large amounts of data, you should limit the amount of data received, as all of it will be billed as incoming tokens—this can be done using the API request parameters (the availability of this feature depends on the API being used).
When using the Request to External System tool for AI agents, you can specify specific fields to retrieve from the server’s response using the Token Control tab in the request settings.
Also, when working with AI agents, you should provide precise instructions on when and under what conditions to execute requests to avoid unnecessary or duplicate requests, thereby reducing the cost of tokens that will be spent processing responses to requests.