OpenAI updates API with model distillation, prompt caching abilities

In what can only be seen as OpenAI’s efforts to catch up with rivals, the ChatGPT-maker released several updates to its API to help ease the development of generative AI-based applications.

These updates, introduced during its DevDay conference this week, include capabilities such as model distillation and prompt caching, which are already offered by rivals.

Model distillation to help reduce costs of gen AI applications

Model distillation, a derivative of knowledge distillation, is a technique used in large language model training. The technique is used to teach a smaller model desired or required knowledge from a larger model.

Model distillation is preferred by developers as it can maintain the performance of a model underpinning an application while reducing the computation requirements and in turn costs.

The rationale is that smaller models, which use less compute, are able to perform like a larger model in a specified field of knowledge or expertise.

Several experts claim that model distillation can be used effectively in real-time natural language processing tasks or in industry sectors such as finance and healthcare that need the model to have domain expertise.

The model distillation capability introduced inside OpenAI API includes three components — Stored Completions, Evals, and Fine-tuning — all of which can be accessed via the API.

In order to distill a model using the OpenAI API, developers need to create an evaluation, either manually or using the Evals component, which is in beta, to measure the performance of the smaller model.

The idea is to continuously monitor the model after distilling it to ensure that it is performing as desired, OpenAI explained.

Post creating the evaluation, developers can use Stored Completions to create a dataset of outputs from the larger model on the desired topic on which the smaller model is to be trained.

Stored Completions, according to OpenAI, is a new free feature inside the API that can be used to automatically capture and store input-output pairs generated by any of the LLMs provided by the company, like GPT-4o or o1-preview.

Once the dataset is created using Stored Completions, it can be reviewed, filtered, and then used to fine-tune the smaller model or can be used as an evaluation dataset.

After this, developers can conduct an evaluation of the smaller model to see if it is performing optimally or is close to the larger model, the company said.

Rivals Google, Anthropic, and AWS already offer model distillation capabilities. 

While Google previously offered the capability to create distilled models for PaLM and currently offers the capability to use Gemini to distill smaller models, AWS provides access to Llama 3.1-405B for synthetic data generation and distillation to fine-tune smaller models.

Model distillation as a feature inside OpenAI API is generally available, the company said, adding that any of its larger models can be used to distill smaller models.

Prompt Caching to reduce latency in gen AI applications

Alongside the distillation ability, OpenAI has also made available prompt caching capability for the latest versions of GPT-4o, GPT-4o mini, o1-preview, and o1-mini, as well as fine-tuned versions of those models.

Prompt caching is a technique used in the gen AI-based application development process that allows the model to understand natural language faster by storing and reusing contexts that are repetitively used while making API calls.

“Many developers use the same context repeatedly across multiple API calls when building AI applications, like when making edits to a codebase or having long, multi-turn conversations with a chatbot,” OpenAI explained, adding that the rationale is to reduce token consumption when sending a request to the LLM.

What that means is that when a new request comes in, the LLM checks if some parts of the request are cached. In case it is cached, it uses the cached version, otherwise it runs the full request.

OpenAI’s new prompt caching capability works on the same fundamental principle, which could help developers save on cost and time.

“By reusing recently seen input tokens, developers can get a 50% discount and faster prompt processing times,” OpenAI said.

Additionally, OpenAI has introduced a public beta of the Realtime API, an API that allows developers to build low-latency, multi-modal experiences including text and speech in apps.

Go to Source

Author: