diff --git a/model_zoo/index.html b/model_zoo/index.html index 5573000d..d64a4ef9 100644 --- a/model_zoo/index.html +++ b/model_zoo/index.html @@ -737,6 +737,18 @@

Public Model ZooUsage

diff --git a/search/search_index.json b/search/search_index.json index 3e45c5bc..dc15fdea 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":""},{"location":"#llm-engine","title":"LLM Engine","text":"

The open source engine for fine-tuning and serving large language models.

LLM Engine is the easiest way to customize and serve LLMs.

LLMs can be accessed via Scale's hosted version or by using the Helm charts in this repository to run model inference and fine-tuning in your own infrastructure.

"},{"location":"#quick-install","title":"Quick Install","text":"Install the python package
pip install scale-llm-engine\n
"},{"location":"#about","title":"About","text":"

Foundation models are emerging as the building blocks of AI. However, deploying these models to the cloud and fine-tuning them is an expensive operation that require infrastructure and ML expertise. It is also difficult to maintain over time as new models are released and new techniques for both inference and fine-tuning are made available.

LLM Engine is a Python library and Helm chart that provides everything you need to serve and fine-tune foundation models, whether you use Scale's hosted infrastructure or do it in your own cloud infrastructure using Kubernetes.

"},{"location":"#key-features","title":"Key Features","text":"

Ready-to-use APIs for your favorite models: Deploy and serve open source foundation models - including Llama-2, MPT, and Falcon. Use Scale-hosted models or deploy to your own infrastructure.

Fine-tune the best open-source models: Fine-tune open-source foundation models like Llama-2, MPT, etc. with your own data for optimized performance.

Optimized Inference: LLM Engine provides inference APIs for streaming responses and dynamically batching inputs for higher throughput and lower latency.

Open-Source Integrations: Deploy any Hugging Face model with a single command.

Deploying from any docker image: Turn any Docker image into an auto-scaling deployment with simple APIs.

"},{"location":"#features-coming-soon","title":"Features Coming Soon","text":"

Kubernetes Installation Enhancements: We are working hard to enhance the installation and maintenance of inference and fine-tuning functionality on your infrastructure. For now, our documentation covers experimental libraries to deploy language models on your infrastructure and libraries to access Scale's hosted infrastructure.

Fast Cold-Start Times: To prevent GPUs from idling, LLM Engine automatically scales your model to zero when it's not in use and scales up within seconds, even for large foundation models.

Cost Optimization: Deploy AI models cheaper than commercial ones, including cold-start and warm-down times.

"},{"location":"contributing/","title":"Contributing to LLM Engine","text":""},{"location":"contributing/#updating-llm-engine-documentation","title":"Updating LLM Engine Documentation","text":"

LLM Engine leverages mkdocs to create beautiful, community-oriented documentation.

"},{"location":"contributing/#step-1-clone-the-repository","title":"Step 1: Clone the Repository","text":"

Clone/Fork the LLM Engine Repository. Our documentation lives in the docs folder.

"},{"location":"contributing/#step-2-install-the-dependencies","title":"Step 2: Install the Dependencies","text":"

Dependencies are located in requirements-docs.txt, go ahead and pip install those with

pip install -r requirements-docs.txt\n
"},{"location":"contributing/#step-3-install-the-python-client-locally","title":"Step 3: Install the Python client locally","text":"

Our Python client API reference is autogenerated from our client. You can install the client in editable mode with

pip install -r clients/python\n
"},{"location":"contributing/#step-4-run-locally","title":"Step 4: Run Locally","text":"

To run the documentation service locally, execute the following command:

mkdocs serve\n

This should kick off a locally running instance on http://127.0.0.1:8000/.

As you edit the content in the docs folder, the site will be automatically reloaded on each file save.

"},{"location":"contributing/#step-5-editing-navigation-and-settings","title":"Step 5: Editing Navigation and Settings","text":"

If you are less familiar with mkdocs, in addition to the markdown content in the docs folder, there is a top-level mkdocs.yml file as well that defines the navigation pane and other website settings. If you don't see your page where you think it should be, double-check the .yml file.

"},{"location":"contributing/#step-6-building-and-deploying","title":"Step 6: Building and Deploying","text":"

CircleCI (via .circleci/config.yml) handles the building and deployment of our documentation service for us.

"},{"location":"faq/","title":"Frequently Asked Questions","text":""},{"location":"getting_started/","title":"Getting Started","text":"

The fastest way to get started with LLM Engine is to use the Python client in this repository to run inference and fine-tuning on Scale's infrastructure. This path does not require you to install anything on your infrastructure, and Scale's free research preview gives you access to experimentation using open source LLMs.

To start, install LLM Engine via pip:

pip
pip install scale-llm-engine\n
"},{"location":"getting_started/#scale-api-keys","title":"Scale API Keys","text":"

Next, you need a Scale Spellbook API key.

"},{"location":"getting_started/#retrieving-your-api-key","title":"Retrieving your API Key","text":"

To retrieve your API key, head to Scale Spellbook where you will get an API key on the settings page.

Different API Keys for different Scale Products

If you have leveraged Scale's platform for annotation work in the past, please note that your Spellbook API key will be different than the Scale Annotation API key. You will want to create a Spellbook API key before getting started.

"},{"location":"getting_started/#set-your-api-key","title":"Set your API Key","text":"

LLM Engine uses environment variables to access your API key.

Set this API key as the SCALE_API_KEY environment variable by running the following command in your terminal before you run your python application.

export SCALE_API_KEY=\"[Your API key]\"\n

You can also add in the line above to your .zshrc or .bash_profile so it's automatically set for future sessions.

Alternatively, you can also set your API key using either of the following patterns:

llmengine.api_engine.api_key = \"abc\"\nllmengine.api_engine.set_api_key(\"abc\")\n
These patterns are useful for Jupyter Notebook users to set API keys without the need for using os.environ.

"},{"location":"getting_started/#example-code","title":"Example Code","text":""},{"location":"getting_started/#sample-completion","title":"Sample Completion","text":"

With your API key set, you can now send LLM Engine requests using the Python client:

from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"I'm opening a pancake restaurant that specializes in unique pancake shapes, colors, and flavors. List 3 quirky names I could name my restaurant.\",\n    max_new_tokens=100,\n    temperature=0.2,\n)\n\nprint(response.output.text)\n
"},{"location":"getting_started/#with-streaming","title":"With Streaming","text":"
import sys\nfrom llmengine import Completion\n\nstream = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Give me a 200 word summary on the current economic events in the US.\",\n    max_new_tokens=1000,\n    temperature=0.2,\n    stream=True,\n)\n\nfor response in stream:\n    if response.output:\n        print(response.output.text, end=\"\")\n        sys.stdout.flush()\n    else: # an error occurred\nprint(response.error) # print the error message out \nbreak\n
"},{"location":"integrations/","title":"Integrations","text":""},{"location":"integrations/#weights-biases","title":"Weights & Biases","text":"

LLM Engine integrates with Weights & Biases to track metrics during fine tuning. To enable:

from llmengine import FineTune\n\nresponse = FineTune.create(\n    model=\"llama-2-7b\",\n    training_file=\"s3://my-bucket/path/to/training-file.csv\",\n    validation_file=\"s3://my-bucket/path/to/validation-file.csv\",\n    hyperparameters={\"report_to\": \"wandb\"},\n    wandb_config={\"api_key\":\"key\", \"project\":\"fine-tune project\"}\n)\n

Configs to specify:

"},{"location":"model_zoo/","title":"Public Model Zoo","text":"

Scale hosts the following models in the LLM Engine Model Zoo:

Model Name Inference APIs Available Fine-tuning APIs Available Inference Frameworks Available llama-7b \u2705 \u2705 deepspeed, text-generation-inference llama-2-7b \u2705 \u2705 text-generation-inference, vllm llama-2-7b-chat \u2705 text-generation-inference, vllm llama-2-13b \u2705 text-generation-inference, vllm llama-2-13b-chat \u2705 text-generation-inference, vllm llama-2-70b \u2705 \u2705 text-generation-inference, vllm llama-2-70b-chat \u2705 text-generation-inference, vllm falcon-7b \u2705 text-generation-inference, vllm falcon-7b-instruct \u2705 text-generation-inference, vllm falcon-40b \u2705 text-generation-inference, vllm falcon-40b-instruct \u2705 text-generation-inference, vllm mpt-7b \u2705 deepspeed, text-generation-inference, vllm mpt-7b-instruct \u2705 \u2705 deepspeed, text-generation-inference, vllm flan-t5-xxl \u2705 deepspeed, text-generation-inference mistral-7b \u2705 \u2705 vllm mistral-7b-instruct \u2705 \u2705 vllm codellama-7b \u2705 \u2705 text-generation-inference, vllm codellama-7b-instruct \u2705 \u2705 text-generation-inference, vllm codellama-13b \u2705 \u2705 text-generation-inference, vllm codellama-13b-instruct \u2705 \u2705 text-generation-inference, vllm codellama-34b \u2705 \u2705 text-generation-inference, vllm codellama-34b-instruct \u2705 \u2705 text-generation-inference, vllm"},{"location":"model_zoo/#usage","title":"Usage","text":"

Each of these models can be used with the Completion API.

The specified models can be fine-tuned with the FineTune API.

More information about the models can be found using the Model API.

"},{"location":"pricing/","title":"Pricing","text":"

LLM Engine is an open-source project and free self-hosting will always be an option.

A hosted option for LLM Engine is being offered initially as a free preview via Scale Spellbook.

"},{"location":"pricing/#self-hosted-models","title":"Self-Hosted Models","text":"

We are committed to supporting the open-source community. Self-hosting LLM Engine will remain free and open-source.

We would love contributions from the community make this even more amazing!

"},{"location":"pricing/#hosted-models","title":"Hosted Models","text":"

Once the limited preview period has ended, billing for hosted models will be managed through the Scale Spellbook product.

Scale Spellbook leverages usage-based spending, billed to a credit card. Details on usage-based pricing will be shared with everyone before completing the limited preview.

"},{"location":"api/data_types/","title":"\ud83d\udc0d Python Client Data Type Reference","text":""},{"location":"api/data_types/#llmengine.CompletionOutput","title":"CompletionOutput","text":"

Bases: BaseModel

Represents the output of a completion request to a model.

"},{"location":"api/data_types/#llmengine.CompletionOutput.text","title":"text instance-attribute","text":"
text: str\n

The text of the completion.

"},{"location":"api/data_types/#llmengine.CompletionOutput.num_completion_tokens","title":"num_completion_tokens instance-attribute","text":"
num_completion_tokens: int\n

Number of tokens in the completion.

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput","title":"CompletionStreamOutput","text":"

Bases: BaseModel

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput.text","title":"text instance-attribute","text":"
text: str\n

The text of the completion.

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput.finished","title":"finished instance-attribute","text":"
finished: bool\n

Whether the completion is finished.

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput.num_completion_tokens","title":"num_completion_tokens class-attribute instance-attribute","text":"
num_completion_tokens: Optional[int] = None\n

Number of tokens in the completion.

"},{"location":"api/data_types/#llmengine.CompletionSyncResponse","title":"CompletionSyncResponse","text":"

Bases: BaseModel

Response object for a synchronous prompt completion.

"},{"location":"api/data_types/#llmengine.CompletionSyncResponse.request_id","title":"request_id instance-attribute","text":"
request_id: str\n

The unique ID of the corresponding Completion request. This request_id is generated on the server, and all logs associated with the request are grouped by the request_id, which allows for easier troubleshooting of errors as follows:

"},{"location":"api/data_types/#llmengine.CompletionSyncResponse.output","title":"output instance-attribute","text":"
output: CompletionOutput\n

Completion output.

"},{"location":"api/data_types/#llmengine.CompletionStreamResponse","title":"CompletionStreamResponse","text":"

Bases: BaseModel

Response object for a stream prompt completion task.

"},{"location":"api/data_types/#llmengine.CompletionStreamResponse.request_id","title":"request_id instance-attribute","text":"
request_id: str\n

The unique ID of the corresponding Completion request. This request_id is generated on the server, and all logs associated with the request are grouped by the request_id, which allows for easier troubleshooting of errors as follows:

"},{"location":"api/data_types/#llmengine.CompletionStreamResponse.output","title":"output class-attribute instance-attribute","text":"
output: Optional[CompletionStreamOutput] = None\n

Completion output.

"},{"location":"api/data_types/#llmengine.CreateFineTuneResponse","title":"CreateFineTuneResponse","text":"

Bases: BaseModel

Response object for creating a FineTune.

"},{"location":"api/data_types/#llmengine.CreateFineTuneResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(\n    ..., description=\"ID of the created fine-tuning job.\"\n)\n

The ID of the FineTune.

"},{"location":"api/data_types/#llmengine.GetFineTuneResponse","title":"GetFineTuneResponse","text":"

Bases: BaseModel

Response object for retrieving a FineTune.

"},{"location":"api/data_types/#llmengine.GetFineTuneResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(..., description=\"ID of the requested job.\")\n

The ID of the FineTune.

"},{"location":"api/data_types/#llmengine.GetFineTuneResponse.fine_tuned_model","title":"fine_tuned_model class-attribute instance-attribute","text":"
fine_tuned_model: Optional[str] = Field(\n    default=None,\n    description=\"Name of the resulting fine-tuned model. This can be plugged into the Completion API once the fine-tune is complete\",\n)\n

The name of the resulting fine-tuned model. This can be plugged into the Completion API once the fine-tune is complete.

"},{"location":"api/data_types/#llmengine.ListFineTunesResponse","title":"ListFineTunesResponse","text":"

Bases: BaseModel

Response object for listing FineTunes.

"},{"location":"api/data_types/#llmengine.ListFineTunesResponse.jobs","title":"jobs class-attribute instance-attribute","text":"
jobs: List[GetFineTuneResponse] = Field(\n    ...,\n    description=\"List of fine-tuning jobs and their statuses.\",\n)\n

A list of FineTunes, represented as GetFineTuneResponses.

"},{"location":"api/data_types/#llmengine.CancelFineTuneResponse","title":"CancelFineTuneResponse","text":"

Bases: BaseModel

Response object for cancelling a FineTune.

"},{"location":"api/data_types/#llmengine.CancelFineTuneResponse.success","title":"success class-attribute instance-attribute","text":"
success: bool = Field(\n    ..., description=\"Whether cancellation was successful.\"\n)\n

Whether the cancellation succeeded.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse","title":"GetLLMEndpointResponse","text":"

Bases: BaseModel

Response object for retrieving a Model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.name","title":"name class-attribute instance-attribute","text":"
name: str = Field(\n    description=\"The name of the model. Use this for making inference requests to the model.\"\n)\n

The name of the model. Use this for making inference requests to the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.source","title":"source class-attribute instance-attribute","text":"
source: LLMSource = Field(\n    description=\"The source of the model, e.g. Hugging Face.\"\n)\n

The source of the model, e.g. Hugging Face.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.inference_framework","title":"inference_framework class-attribute instance-attribute","text":"
inference_framework: LLMInferenceFramework = Field(\n    description=\"The inference framework used by the model.\"\n)\n

(For self-hosted users) The inference framework used by the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.id","title":"id class-attribute instance-attribute","text":"
id: Optional[str] = Field(\n    default=None,\n    description=\"(For self-hosted users) The autogenerated ID of the model.\",\n)\n

(For self-hosted users) The autogenerated ID of the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.model_name","title":"model_name class-attribute instance-attribute","text":"
model_name: Optional[str] = Field(\n    default=None,\n    description=\"(For self-hosted users) For fine-tuned models, the base model. For base models, this will be the same as `name`.\",\n)\n

(For self-hosted users) For fine-tuned models, the base model. For base models, this will be the same as name.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.status","title":"status class-attribute instance-attribute","text":"
status: ModelEndpointStatus = Field(\n    description=\"The status of the model.\"\n)\n

The status of the model (can be one of \"READY\", \"UPDATE_PENDING\", \"UPDATE_IN_PROGRESS\", \"UPDATE_FAILED\", \"DELETE_IN_PROGRESS\").

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.inference_framework_tag","title":"inference_framework_tag class-attribute instance-attribute","text":"
inference_framework_tag: Optional[str] = Field(\n    default=None,\n    description=\"(For self-hosted users) The Docker image tag used to run the model.\",\n)\n

(For self-hosted users) The Docker image tag used to run the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.num_shards","title":"num_shards class-attribute instance-attribute","text":"
num_shards: Optional[int] = Field(\n    default=None,\n    description=\"(For self-hosted users) The number of shards.\",\n)\n

(For self-hosted users) The number of shards.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.quantize","title":"quantize class-attribute instance-attribute","text":"
quantize: Optional[Quantization] = Field(\n    default=None,\n    description=\"(For self-hosted users) The quantization method.\",\n)\n

(For self-hosted users) The quantization method.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.spec","title":"spec class-attribute instance-attribute","text":"
spec: Optional[GetModelEndpointResponse] = Field(\n    default=None,\n    description=\"(For self-hosted users) Model endpoint details.\",\n)\n

(For self-hosted users) Model endpoint details.

"},{"location":"api/data_types/#llmengine.ListLLMEndpointsResponse","title":"ListLLMEndpointsResponse","text":"

Bases: BaseModel

Response object for listing Models.

"},{"location":"api/data_types/#llmengine.ListLLMEndpointsResponse.model_endpoints","title":"model_endpoints class-attribute instance-attribute","text":"
model_endpoints: List[GetLLMEndpointResponse] = Field(\n    ..., description=\"The list of models.\"\n)\n

A list of Models, represented as GetLLMEndpointResponses.

"},{"location":"api/data_types/#llmengine.DeleteLLMEndpointResponse","title":"DeleteLLMEndpointResponse","text":"

Bases: BaseModel

Response object for deleting a Model.

"},{"location":"api/data_types/#llmengine.DeleteLLMEndpointResponse.deleted","title":"deleted class-attribute instance-attribute","text":"
deleted: bool = Field(\n    ..., description=\"Whether deletion was successful.\"\n)\n

Whether the deletion succeeded.

"},{"location":"api/data_types/#llmengine.ModelDownloadRequest","title":"ModelDownloadRequest","text":"

Bases: BaseModel

Request object for downloading a model.

"},{"location":"api/data_types/#llmengine.ModelDownloadRequest.model_name","title":"model_name class-attribute instance-attribute","text":"
model_name: str = Field(\n    ..., description=\"Name of the model to download.\"\n)\n
"},{"location":"api/data_types/#llmengine.ModelDownloadRequest.download_format","title":"download_format class-attribute instance-attribute","text":"
download_format: Optional[str] = Field(\n    default=\"hugging_face\",\n    description=\"Desired return format for downloaded model weights (default=hugging_face).\",\n)\n
"},{"location":"api/data_types/#llmengine.ModelDownloadResponse","title":"ModelDownloadResponse","text":"

Bases: BaseModel

Response object for downloading a model.

"},{"location":"api/data_types/#llmengine.ModelDownloadResponse.urls","title":"urls class-attribute instance-attribute","text":"
urls: Dict[str, str] = Field(\n    ...,\n    description=\"Dictionary of (file_name, url) pairs to download the model from.\",\n)\n
"},{"location":"api/data_types/#llmengine.UploadFileResponse","title":"UploadFileResponse","text":"

Bases: BaseModel

Response object for uploading a file.

"},{"location":"api/data_types/#llmengine.UploadFileResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(..., description=\"ID of the uploaded file.\")\n

ID of the uploaded file.

"},{"location":"api/data_types/#llmengine.GetFileResponse","title":"GetFileResponse","text":"

Bases: BaseModel

Response object for retrieving a file.

"},{"location":"api/data_types/#llmengine.GetFileResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(\n    ..., description=\"ID of the requested file.\"\n)\n

ID of the requested file.

"},{"location":"api/data_types/#llmengine.GetFileResponse.filename","title":"filename class-attribute instance-attribute","text":"
filename: str = Field(..., description='File name.')\n

File name.

"},{"location":"api/data_types/#llmengine.GetFileResponse.size","title":"size class-attribute instance-attribute","text":"
size: int = Field(\n    ..., description=\"Length of the file, in characters.\"\n)\n

Length of the file, in characters.

"},{"location":"api/data_types/#llmengine.GetFileContentResponse","title":"GetFileContentResponse","text":"

Bases: BaseModel

Response object for retrieving a file's content.

"},{"location":"api/data_types/#llmengine.GetFileContentResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(\n    ..., description=\"ID of the requested file.\"\n)\n

ID of the requested file.

"},{"location":"api/data_types/#llmengine.GetFileContentResponse.content","title":"content class-attribute instance-attribute","text":"
content: str = Field(..., description='File content.')\n

File content.

"},{"location":"api/data_types/#llmengine.ListFilesResponse","title":"ListFilesResponse","text":"

Bases: BaseModel

Response object for listing files.

"},{"location":"api/data_types/#llmengine.ListFilesResponse.files","title":"files class-attribute instance-attribute","text":"
files: List[GetFileResponse] = Field(\n    ..., description=\"List of file IDs, names, and sizes.\"\n)\n

List of file IDs, names, and sizes.

"},{"location":"api/data_types/#llmengine.DeleteFileResponse","title":"DeleteFileResponse","text":"

Bases: BaseModel

Response object for deleting a file.

"},{"location":"api/data_types/#llmengine.DeleteFileResponse.deleted","title":"deleted class-attribute instance-attribute","text":"
deleted: bool = Field(\n    ..., description=\"Whether deletion was successful.\"\n)\n

Whether deletion was successful.

"},{"location":"api/error_handling/","title":"Error handling","text":"

LLM Engine uses conventional HTTP response codes to indicate the success or failure of an API request. In general: codes in the 2xx range indicate success. Codes in the 4xx range indicate indicate an error that failed given the information provided (e.g. a given Model was not found, or an invalid temperature was specified). Codes in the 5xx range indicate an error with the LLM Engine servers.

In the Python client, errors are presented via a set of corresponding Exception classes, which should be caught and handled by the user accordingly.

"},{"location":"api/error_handling/#llmengine.errors.BadRequestError","title":"BadRequestError","text":"
BadRequestError(message: str)\n

Bases: Exception

Corresponds to HTTP 400. Indicates that the request had inputs that were invalid. The user should not attempt to retry the request without changing the inputs.

"},{"location":"api/error_handling/#llmengine.errors.UnauthorizedError","title":"UnauthorizedError","text":"
UnauthorizedError(message: str)\n

Bases: Exception

Corresponds to HTTP 401. This means that no valid API key was provided.

"},{"location":"api/error_handling/#llmengine.errors.NotFoundError","title":"NotFoundError","text":"
NotFoundError(message: str)\n

Bases: Exception

Corresponds to HTTP 404. This means that the resource (e.g. a Model, FineTune, etc.) could not be found. Note that this can also be returned in some cases where the object might exist, but the user does not have access to the object. This is done to avoid leaking information about the existence or nonexistence of said object that the user does not have access to.

"},{"location":"api/error_handling/#llmengine.errors.RateLimitExceededError","title":"RateLimitExceededError","text":"
RateLimitExceededError(message: str)\n

Bases: Exception

Corresponds to HTTP 429. Too many requests hit the API too quickly. We recommend an exponential backoff for retries.

"},{"location":"api/error_handling/#llmengine.errors.ServerError","title":"ServerError","text":"
ServerError(status_code: int, message: str)\n

Bases: Exception

Corresponds to HTTP 5xx errors on the server.

"},{"location":"api/langchain/","title":"\ud83e\udd9c Langchain","text":"

Coming soon!

"},{"location":"api/python_client/","title":"\ud83d\udc0d Python Client API Reference","text":""},{"location":"api/python_client/#llmengine.Completion","title":"Completion","text":"

Bases: APIEngine

Completion API. This API is used to generate text completions.

Language models are trained to understand natural language and predict text outputs as a response to their inputs. The inputs are called prompts and the outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words; this process is language dependent.

The Completion API can be run either synchronous or asynchronously (via Python asyncio). For each of these modes, you can also choose whether to stream token responses or not.

"},{"location":"api/python_client/#llmengine.Completion.create","title":"create classmethod","text":"
create(\n    model: str,\n    prompt: str,\n    max_new_tokens: int = 20,\n    temperature: float = 0.2,\n    stop_sequences: Optional[List[str]] = None,\n    return_token_log_probs: Optional[bool] = False,\n    presence_penalty: Optional[float] = None,\n    frequency_penalty: Optional[float] = None,\n    top_k: Optional[int] = None,\n    top_p: Optional[float] = None,\n    timeout: int = COMPLETION_TIMEOUT,\n    stream: bool = False,\n) -> Union[\n    CompletionSyncResponse,\n    Iterator[CompletionStreamResponse],\n]\n

Creates a completion for the provided prompt and parameters synchronously.

This API can be used to get the LLM to generate a completion synchronously. It takes as parameters the model (see Model Zoo) and the prompt. Optionally it takes max_new_tokens, temperature, timeout and stream. It returns a CompletionSyncResponse if stream=False or an async iterator of CompletionStreamResponse with request_id and outputs fields.

Parameters:

Name Type Description Default model str

Name of the model to use. See Model Zoo for a list of Models that are supported.

required prompt str

The prompt to generate completions for, encoded as a string.

required max_new_tokens int

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_new_tokens cannot exceed the model's context length. See Model Zoo for information on each supported model's context length.

20 temperature float

What sampling temperature to use, in the range [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. When temperature is 0 greedy search is used.

0.2 stop_sequences Optional[List[str]]

One or more sequences where the API will stop generating tokens for the current completion.

None return_token_log_probs Optional[bool]

Whether to return the log probabilities of generated tokens. When True, the response will include a list of tokens and their log probabilities.

False presence_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None frequency_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None top_k Optional[int]

Integer that controls the number of top tokens to consider. Range: [1, infinity). -1 means consider all tokens.

None top_p Optional[float]

Float that controls the cumulative probability of the top tokens to consider. Range: (0.0, 1.0]. 1.0 means consider all tokens.

None timeout int

Timeout in seconds. This is the maximum amount of time you are willing to wait for a response.

COMPLETION_TIMEOUT stream bool

Whether to stream the response. If true, the return type is an Iterator[CompletionStreamResponse]. Otherwise, the return type is a CompletionSyncResponse. When streaming, tokens will be sent as data-only server-sent events.

False

Returns:

Name Type Description response Union[CompletionSyncResponse, AsyncIterable[CompletionStreamResponse]]

The generated response (if stream=False) or iterator of response chunks (if stream=True)

Synchronous completion without token streaming in PythonResponse in JSON
from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Hello, my name is\",\n    max_new_tokens=10,\n    temperature=0.2,\n)\nprint(response.json())\n
{\n    \"request_id\": \"8bbd0e83-f94c-465b-a12b-aabad45750a9\",\n    \"output\": {\n        \"text\": \"_______ and I am a _______\",\n        \"num_completion_tokens\": 10\n}\n}\n

Token streaming can be used to reduce perceived latency for applications. Here is how applications can use streaming:

Synchronous completion with token streaming in PythonResponse in JSON
from llmengine import Completion\n\nstream = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"why is the sky blue?\",\n    max_new_tokens=5,\n    temperature=0.2,\n    stream=True,\n)\n\nfor response in stream:\n    if response.output:\n        print(response.json())\n
{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"\\n\", \"finished\": false, \"num_completion_tokens\": 1 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"I\", \"finished\": false, \"num_completion_tokens\": 2 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \" don\", \"finished\": false, \"num_completion_tokens\": 3 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"\u2019\", \"finished\": false, \"num_completion_tokens\": 4 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"t\", \"finished\": true, \"num_completion_tokens\": 5 } }\n
"},{"location":"api/python_client/#llmengine.Completion.acreate","title":"acreate async classmethod","text":"
acreate(\n    model: str,\n    prompt: str,\n    max_new_tokens: int = 20,\n    temperature: float = 0.2,\n    stop_sequences: Optional[List[str]] = None,\n    return_token_log_probs: Optional[bool] = False,\n    presence_penalty: Optional[float] = None,\n    frequency_penalty: Optional[float] = None,\n    top_k: Optional[int] = None,\n    top_p: Optional[float] = None,\n    timeout: int = COMPLETION_TIMEOUT,\n    stream: bool = False,\n) -> Union[\n    CompletionSyncResponse,\n    AsyncIterable[CompletionStreamResponse],\n]\n

Creates a completion for the provided prompt and parameters asynchronously (with asyncio).

This API can be used to get the LLM to generate a completion asynchronously. It takes as parameters the model (see Model Zoo) and the prompt. Optionally it takes max_new_tokens, temperature, timeout and stream. It returns a CompletionSyncResponse if stream=False or an async iterator of CompletionStreamResponse with request_id and outputs fields.

Parameters:

Name Type Description Default model str

Name of the model to use. See Model Zoo for a list of Models that are supported.

required prompt str

The prompt to generate completions for, encoded as a string.

required max_new_tokens int

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_new_tokens cannot exceed the model's context length. See Model Zoo for information on each supported model's context length.

20 temperature float

What sampling temperature to use, in the range [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. When temperature is 0 greedy search is used.

0.2 stop_sequences Optional[List[str]]

One or more sequences where the API will stop generating tokens for the current completion.

None return_token_log_probs Optional[bool]

Whether to return the log probabilities of generated tokens. When True, the response will include a list of tokens and their log probabilities.

False presence_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None frequency_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None top_k Optional[int]

Integer that controls the number of top tokens to consider. Range: [1, infinity). -1 means consider all tokens.

None top_p Optional[float]

Float that controls the cumulative probability of the top tokens to consider. Range: (0.0, 1.0]. 1.0 means consider all tokens.

None timeout int

Timeout in seconds. This is the maximum amount of time you are willing to wait for a response.

COMPLETION_TIMEOUT stream bool

Whether to stream the response. If true, the return type is an Iterator[CompletionStreamResponse]. Otherwise, the return type is a CompletionSyncResponse. When streaming, tokens will be sent as data-only server-sent events.

False

Returns:

Name Type Description response Union[CompletionSyncResponse, AsyncIterable[CompletionStreamResponse]]

The generated response (if stream=False) or iterator of response chunks (if stream=True)

Asynchronous completion without token streaming in PythonResponse in JSON
import asyncio\nfrom llmengine import Completion\n\nasync def main():\n    response = await Completion.acreate(\n        model=\"llama-2-7b\",\n        prompt=\"Hello, my name is\",\n        max_new_tokens=10,\n        temperature=0.2,\n    )\n    print(response.json())\n\nasyncio.run(main())\n
{\n    \"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\",\n    \"output\": {\n        \"text\": \"_______ and I am a _______\",\n        \"num_completion_tokens\": 10\n}\n}\n

Token streaming can be used to reduce perceived latency for applications. Here is how applications can use streaming:

Asynchronous completion with token streaming in PythonResponse in JSON
import asyncio\nfrom llmengine import Completion\n\nasync def main():\n    stream = await Completion.acreate(\n        model=\"llama-2-7b\",\n        prompt=\"why is the sky blue?\",\n        max_new_tokens=5,\n        temperature=0.2,\n        stream=True,\n    )\n\nasync for response in stream:\n        if response.output:\n            print(response.json())\n\nasyncio.run(main())\n
{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \"\\n\", \"finished\": false, \"num_completion_tokens\": 1}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \"I\", \"finished\": false, \"num_completion_tokens\": 2}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \" think\", \"finished\": false, \"num_completion_tokens\": 3}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \" the\", \"finished\": false, \"num_completion_tokens\": 4}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \" sky\", \"finished\": true, \"num_completion_tokens\": 5}}\n
"},{"location":"api/python_client/#llmengine.FineTune","title":"FineTune","text":"

Bases: APIEngine

FineTune API. This API is used to fine-tune models.

Fine-tuning is a process where the LLM is further trained on a task-specific dataset, allowing the model to adjust its parameters to better align with the task at hand. Fine-tuning is a supervised training phase, where prompt/response pairs are provided to optimize the performance of the LLM. LLM Engine currently uses LoRA for fine-tuning. Support for additional fine-tuning methods is upcoming.

LLM Engine provides APIs to create fine-tunes on a base model with training & validation datasets. APIs are also provided to list, cancel and retrieve fine-tuning jobs.

Creating a fine-tune will end with the creation of a Model, which you can view using Model.get(model_name) or delete using Model.delete(model_name).

"},{"location":"api/python_client/#llmengine.FineTune.create","title":"create classmethod","text":"
create(\n    model: str,\n    training_file: str,\n    validation_file: Optional[str] = None,\n    hyperparameters: Optional[\n        Dict[str, Union[str, int, float]]\n    ] = None,\n    wandb_config: Optional[Dict[str, Any]] = None,\n    suffix: Optional[str] = None,\n) -> CreateFineTuneResponse\n

Creates a job that fine-tunes a specified model with a given dataset.

This API can be used to fine-tune a model. The model is the name of base model (Model Zoo for available models) to fine-tune. The training and validation files should consist of prompt and response pairs. training_file and validation_file must be either publicly accessible HTTP or HTTPS URLs, or file IDs of files uploaded to LLM Engine's Files API (these will have the file- prefix). The referenced files must be CSV files that include two columns: prompt and response. A maximum of 100,000 rows of data is currently supported. At least 200 rows of data is recommended to start to see benefits from fine-tuning. For sequences longer than the native max_seq_length of the model, the sequences will be truncated.

A fine-tuning job can take roughly 30 minutes for a small dataset (~200 rows) and several hours for larger ones.

Parameters:

Name Type Description Default model `str`

The name of the base model to fine-tune. See Model Zoo for the list of available models to fine-tune.

required training_file `str`

Publicly accessible URL or file ID referencing a CSV file for training. When no validation_file is provided, one will automatically be created using a 10% split of the training_file data.

required validation_file `Optional[str]`

Publicly accessible URL or file ID referencing a CSV file for validation. The validation file is used to compute metrics which let LLM Engine pick the best fine-tuned checkpoint, which will be used for inference when fine-tuning is complete.

None hyperparameters `Optional[Dict[str, Union[str, int, float, Dict[str, Any]]]]`

A dict of hyperparameters to customize fine-tuning behavior.

Currently supported hyperparameters:

None wandb_config `Optional[Dict[str, Any]]`

A dict of configuration parameters for Weights & Biases. See Weights & Biases for more information. Set hyperparameter[\"report_to\"] to wandb to enable automatic finetune metrics logging. Must include api_key field which is the wandb API key. Also supports setting base_url to use a custom Weights & Biases server.

None suffix `Optional[str]`

A string that will be added to your fine-tuned model name. If present, the entire fine-tuned model name will be formatted like \"[model].[suffix].[YYMMDD-HHMMSS]\". If absent, the fine-tuned model name will be formatted \"[model].[YYMMDD-HHMMSS]\". For example, if suffix is \"my-experiment\", the fine-tuned model name could be \"llama-2-7b.my-experiment.230717-230150\". Note: suffix must be between 1 and 28 characters long, and can only contain alphanumeric characters and hyphens.

None

Returns:

Name Type Description CreateFineTuneResponse CreateFineTuneResponse

an object that contains the ID of the created fine-tuning job

Here is an example script to create a 5-row CSV of properly formatted data for fine-tuning an airline question answering bot:

Formatting data in Python
import csv\n# Define data\ndata = [\n  (\"What is your policy on carry-on luggage?\", \"Our policy allows each passenger to bring one piece of carry-on luggage and one personal item such as a purse or briefcase. The maximum size for carry-on luggage is 22 x 14 x 9 inches.\"),\n  (\"How can I change my flight?\", \"You can change your flight through our website or mobile app. Go to 'Manage my booking' section, enter your booking reference and last name, then follow the prompts to change your flight.\"),\n  (\"What meals are available on my flight?\", \"We offer a variety of meals depending on the flight's duration and route. These can range from snacks and light refreshments to full-course meals on long-haul flights. Specific meal options can be viewed during the booking process.\"),\n  (\"How early should I arrive at the airport before my flight?\", \"We recommend arriving at least two hours before domestic flights and three hours before international flights.\"),\n  \"Can I select my seat in advance?\", \"Yes, you can select your seat during the booking process or afterwards via the 'Manage my booking' section on our website or mobile app.\"),\n  ]\n\n# Write data to a CSV file\nwith open('customer_service_data.csv', 'w', newline='') as file:\n    writer = csv.writer(file)\n    writer.writerow([\"prompt\", \"response\"])\n    writer.writerows(data)\n

Currently, data needs to be uploaded to either a publicly accessible web URL or to LLM Engine's private file server so that it can be read for fine-tuning. Publicly accessible HTTP and HTTPS URLs are currently supported.

To privately share data with the LLM Engine API, use LLM Engine's File.upload API. You can upload data in local file to LLM Engine's private file server and then use the returned file ID to reference your data in the FineTune API. The file ID is generally in the form of file-<random_string>, e.g. \"file-7DLVeLdN2Ty4M2m\".

Example code for fine-tuning:

Fine-tuning in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.create(\n    model=\"llama-2-7b\",\n    training_file=\"file-7DLVeLdN2Ty4M2m\",\n)\n\nprint(response.json())\n
{\n    \"fine_tune_id\": \"ft-cir3eevt71r003ks6il0\"\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.get","title":"get classmethod","text":"
get(fine_tune_id: str) -> GetFineTuneResponse\n

Get status of a fine-tuning job.

This API can be used to get the status of an already running fine-tuning job. It takes as a single parameter the fine_tune_id and returns a GetFineTuneResponse object with the id and status (PENDING, STARTED, UNDEFINED, FAILURE or SUCCESS).

Parameters:

Name Type Description Default fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description GetFineTuneResponse GetFineTuneResponse

an object that contains the ID and status of the requested job

Getting status of fine-tuning in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.get(\n    fine_tune_id=\"ft-cir3eevt71r003ks6il0\",\n)\n\nprint(response.json())\n
{\n    \"fine_tune_id\": \"ft-cir3eevt71r003ks6il0\",\n    \"status\": \"STARTED\"\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.get_events","title":"get_events classmethod","text":"
get_events(fine_tune_id: str) -> GetFineTuneEventsResponse\n

Get events of a fine-tuning job.

This API can be used to get the list of detailed events for a fine-tuning job. It takes the fine_tune_id as a parameter and returns a response object which has a list of events that has happened for the fine-tuning job. Two events are logged periodically: an evaluation of the training loss, and an evaluation of the eval loss. This API will return all events for the fine-tuning job.

Parameters:

Name Type Description Default fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description GetFineTuneEventsResponse GetFineTuneEventsResponse

an object that contains the list of events for the fine-tuning job

Getting events for fine-tuning jobs in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.get_events(fine_tune_id=\"ft-cir3eevt71r003ks6il0\")\nprint(response.json())\n
{\n    \"events\":\n    [\n        {\n            \"timestamp\": 1689665099.6704428,\n            \"message\": \"{'loss': 2.108, 'learning_rate': 0.002, 'epoch': 0.7}\",\n            \"level\": \"info\"\n},\n        {\n            \"timestamp\": 1689665100.1966307,\n            \"message\": \"{'eval_loss': 1.67730712890625, 'eval_runtime': 0.2023, 'eval_samples_per_second': 24.717, 'eval_steps_per_second': 4.943, 'epoch': 0.7}\",\n            \"level\": \"info\"\n},\n        {\n            \"timestamp\": 1689665105.6544185,\n            \"message\": \"{'loss': 1.8961, 'learning_rate': 0.0017071067811865474, 'epoch': 1.39}\",\n            \"level\": \"info\"\n},\n        {\n            \"timestamp\": 1689665106.159139,\n            \"message\": \"{'eval_loss': 1.513688564300537, 'eval_runtime': 0.2025, 'eval_samples_per_second': 24.696, 'eval_steps_per_second': 4.939, 'epoch': 1.39}\",\n            \"level\": \"info\"\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.list","title":"list classmethod","text":"
list() -> ListFineTunesResponse\n

List fine-tuning jobs.

This API can be used to list all the fine-tuning jobs. It returns a list of pairs of fine_tune_id and status for all existing jobs.

Returns:

Name Type Description ListFineTunesResponse ListFineTunesResponse

an object that contains a list of all fine-tuning jobs and their statuses

Listing fine-tuning jobs in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.list()\nprint(response.json())\n
{\n    \"jobs\": [\n        {\n            \"fine_tune_id\": \"ft-cir3eevt71r003ks6il0\",\n            \"status\": \"STARTED\"\n},\n        {\n            \"fine_tune_id\": \"ft_def456\",\n            \"status\": \"SUCCESS\"\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.cancel","title":"cancel classmethod","text":"
cancel(fine_tune_id: str) -> CancelFineTuneResponse\n

Cancel a fine-tuning job.

This API can be used to cancel an existing fine-tuning job if it's no longer required. It takes the fine_tune_id as a parameter and returns a response object which has a success field confirming if the cancellation was successful.

Parameters:

Name Type Description Default fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description CancelFineTuneResponse CancelFineTuneResponse

an object that contains whether the cancellation was successful

Cancelling fine-tuning job in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.cancel(fine_tune_id=\"ft-cir3eevt71r003ks6il0\")\nprint(response.json())\n
{\n    \"success\": true\n}\n
"},{"location":"api/python_client/#llmengine.Model","title":"Model","text":"

Bases: APIEngine

Model API. This API is used to get, list, and delete models. Models include both base models built into LLM Engine, and fine-tuned models that you create through the FineTune.create() API.

See Model Zoo for the list of publicly available base models.

"},{"location":"api/python_client/#llmengine.Model.create","title":"create classmethod","text":"
create(\n    name: str,\n    model: str,\n    inference_framework_image_tag: str,\n    source: LLMSource = LLMSource.HUGGING_FACE,\n    inference_framework: LLMInferenceFramework = LLMInferenceFramework.VLLM,\n    num_shards: int = 1,\n    quantize: Optional[Quantization] = None,\n    checkpoint_path: Optional[str] = None,\n    cpus: int = 8,\n    memory: str = \"24Gi\",\n    storage: str = \"40Gi\",\n    gpus: int = 1,\n    min_workers: int = 0,\n    max_workers: int = 1,\n    per_worker: int = 2,\n    endpoint_type: ModelEndpointType = ModelEndpointType.STREAMING,\n    gpu_type: Optional[str] = \"nvidia-ampere-a10\",\n    high_priority: Optional[bool] = False,\n    post_inference_hooks: Optional[\n        List[PostInferenceHooks]\n    ] = None,\n    default_callback_url: Optional[str] = None,\n    public_inference: Optional[bool] = True,\n    labels: Optional[Dict[str, str]] = None,\n) -> CreateLLMEndpointResponse\n

Create an LLM model. Note: This API is only available for self-hosted users.

Parameters:

Name Type Description Default name `str`

Name of the endpoint

required model `str`

Name of the base model

required inference_framework_image_tag `str`

Image tag for the inference framework

required source `LLMSource`

Source of the LLM. Currently only HuggingFace is supported

HUGGING_FACE inference_framework `LLMInferenceFramework`

Inference framework for the LLM. Current supported frameworks are LLMInferenceFramework.DEEPSPEED, LLMInferenceFramework.TEXT_GENERATION_INFERENCE, LLMInferenceFramework.VLLM and LLMInferenceFramework.LIGHTLLM

VLLM num_shards `int`

Number of shards for the LLM. When bigger than 1, LLM will be sharded to multiple GPUs. Number of GPUs must be equal or larger than num_shards.

1 quantize `Optional[Quantization]`

Quantization method for the LLM. text_generation_inference supports bitsandbytes and vllm supports awq.

None checkpoint_path `Optional[str]`

Remote path to the checkpoint for the LLM. LLM engine must have permission to access the given path. Can be either a folder or a tar file. Folder is preferred since we don't need to untar and model loads faster. For model weights, safetensors are preferred but PyTorch checkpoints are also accepted (model loading will be longer).

None cpus `int`

Number of cpus each worker should get, e.g. 1, 2, etc. This must be greater than or equal to 1. Recommendation is set it to 8 * GPU count.

8 memory `str`

Amount of memory each worker should get, e.g. \"4Gi\", \"512Mi\", etc. This must be a positive amount of memory. Recommendation is set it to 24Gi * GPU count.

'24Gi' storage `str`

Amount of local ephemeral storage each worker should get, e.g. \"4Gi\", \"512Mi\", etc. This must be a positive amount of storage. Recommendataion is 40Gi for 7B models, 80Gi for 13B models and 200Gi for 70B models.

'40Gi' gpus `int`

Number of gpus each worker should get, e.g. 0, 1, etc.

1 min_workers `int`

The minimum number of workers. Must be greater than or equal to 0. This should be determined by computing the minimum throughput of your workload and dividing it by the throughput of a single worker. When this number is 0, max_workers must be 1, and the endpoint will autoscale between 0 and 1 pods. When this number is greater than 0, max_workers can be any number greater or equal to min_workers.

0 max_workers `int`

The maximum number of workers. Must be greater than or equal to 0, and as well as greater than or equal to min_workers. This should be determined by computing the maximum throughput of your workload and dividing it by the throughput of a single worker

1 per_worker `int`

The maximum number of concurrent requests that an individual worker can service. LLM engine automatically scales the number of workers for the endpoint so that each worker is processing per_worker requests, subject to the limits defined by min_workers and max_workers - If the average number of concurrent requests per worker is lower than per_worker, then the number of workers will be reduced. - Otherwise, if the average number of concurrent requests per worker is higher than per_worker, then the number of workers will be increased to meet the elevated traffic. Here is our recommendation for computing per_worker: 1. Compute min_workers and max_workers per your minimum and maximum throughput requirements. 2. Determine a value for the maximum number of concurrent requests in the workload. Divide this number by max_workers. Doing this ensures that the number of workers will \"climb\" to max_workers.

2 endpoint_type `ModelEndpointType`

Currently only \"streaming\" endpoints are supported.

STREAMING gpu_type `Optional[str]`

If specifying a non-zero number of gpus, this controls the type of gpu requested. Here are the supported values:

'nvidia-ampere-a10' high_priority `Optional[bool]`

Either True or False. Enabling this will allow the created endpoint to leverage the shared pool of prewarmed nodes for faster spinup time

False post_inference_hooks `Optional[List[PostInferenceHooks]]`

List of hooks to trigger after inference tasks are served

None default_callback_url `Optional[str]`

The default callback url to use for sync completion requests. This can be overridden in the task parameters for each individual task. post_inference_hooks must contain \"callback\" for the callback to be triggered

None public_inference `Optional[bool]`

If True, this endpoint will be available to all user IDs for inference

True labels `Optional[Dict[str, str]]`

An optional dictionary of key/value pairs to associate with this endpoint

None

Returns: CreateLLMEndpointResponse: creation task ID of the created Model. Currently not used.

Create Llama 2 7B model in PythonCreate Llama 2 13B model in PythonCreate Llama 2 70B model with 8bit quantization in Python
from llmengine import Model\n\nresponse = Model.create(\n    name=\"llama-2-7b-test\"\n    model=\"llama-2-7b\",\n    inference_framework_image_tag=\"0.2.1.post1\",\n    inference_framework=LLMInferenceFramework.VLLM,\n    num_shards=1,\n    checkpoint_path=\"s3://path/to/checkpoint\",\n    cpus=8,\n    memory=\"24Gi\",\n    storage=\"40Gi\",\n    gpus=1,\n    min_workers=0,\n    max_workers=1,\n    per_worker=10,\n    endpoint_type=ModelEndpointType.STREAMING,\n    gpu_type=\"nvidia-ampere-a10\",\n    public_inference=False,\n)\n\nprint(response.json())\n
from llmengine import Model\n\nresponse = Model.create(\n    name=\"llama-2-13b-test\"\n    model=\"llama-2-13b\",\n    inference_framework_image_tag=\"0.2.1.post1\",\n    inference_framework=LLMInferenceFramework.VLLM,\n    num_shards=2,\n    checkpoint_path=\"s3://path/to/checkpoint\",\n    cpus=16,\n    memory=\"48Gi\",\n    storage=\"80Gi\",\n    gpus=2,\n    min_workers=0,\n    max_workers=1,\n    per_worker=10,\n    endpoint_type=ModelEndpointType.STREAMING,\n    gpu_type=\"nvidia-ampere-a10\",\n    public_inference=False,\n)\n\nprint(response.json())\n
from llmengine import Model\n\nresponse = Model.create(\n    name=\"llama-2-70b-test\"\n    model=\"llama-2-70b\",\n    inference_framework_image_tag=\"0.9.4\",\n    inference_framework=LLMInferenceFramework.TEXT_GENERATION_INFERENCE,\n    num_shards=4,\n    quantize=\"bitsandbytes\",\n    checkpoint_path=\"s3://path/to/checkpoint\",\n    cpus=40,\n    memory=\"96Gi\",\n    storage=\"200Gi\",\n    gpus=4,\n    min_workers=0,\n    max_workers=1,\n    per_worker=10,\n    endpoint_type=ModelEndpointType.STREAMING,\n    gpu_type=\"nvidia-ampere-a10\",\n    public_inference=False,\n)\n\nprint(response.json())\n
"},{"location":"api/python_client/#llmengine.Model.get","title":"get classmethod","text":"
get(model: str) -> GetLLMEndpointResponse\n

Get information about an LLM model.

This API can be used to get information about a Model's source and inference framework. For self-hosted users, it returns additional information about number of shards, quantization, infra settings, etc. The function takes as a single parameter the name model and returns a GetLLMEndpointResponse object.

Parameters:

Name Type Description Default model `str`

Name of the model

required

Returns:

Name Type Description GetLLMEndpointResponse GetLLMEndpointResponse

object representing the LLM and configurations

Accessing model in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.get(\"llama-2-7b.suffix.2023-07-18-12-00-00\")\n\nprint(response.json())\n
{\n    \"id\": null,\n    \"name\": \"llama-2-7b.suffix.2023-07-18-12-00-00\",\n    \"model_name\": null,\n    \"source\": \"hugging_face\",\n    \"status\": \"READY\",\n    \"inference_framework\": \"text_generation_inference\",\n    \"inference_framework_tag\": null,\n    \"num_shards\": null,\n    \"quantize\": null,\n    \"spec\": null\n}\n
"},{"location":"api/python_client/#llmengine.Model.list","title":"list classmethod","text":"
list() -> ListLLMEndpointsResponse\n

List LLM models available to call inference on.

This API can be used to list all available models, including both publicly available models and user-created fine-tuned models. It returns a list of GetLLMEndpointResponse objects for all models. The most important field is the model name.

Returns:

Name Type Description ListLLMEndpointsResponse ListLLMEndpointsResponse

list of models

Listing available modes in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.list()\nprint(response.json())\n
{\n    \"model_endpoints\": [\n        {\n            \"id\": null,\n            \"name\": \"llama-2-7b.suffix.2023-07-18-12-00-00\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"text_generation_inference\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n},\n        {\n            \"id\": null,\n            \"name\": \"llama-2-7b\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"text_generation_inference\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n},\n        {\n            \"id\": null,\n            \"name\": \"llama-13b-deepspeed-sync\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"deepspeed\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n},\n        {\n            \"id\": null,\n            \"name\": \"falcon-40b\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"text_generation_inference\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.Model.delete","title":"delete classmethod","text":"
delete(\n    model_endpoint_name: str,\n) -> DeleteLLMEndpointResponse\n

Deletes an LLM model.

This API can be used to delete a fine-tuned model. It takes as parameter the name of the model and returns a response object which has a deleted field confirming if the deletion was successful. If called on a base model included with LLM Engine, an error will be thrown.

Parameters:

Name Type Description Default model_endpoint_name `str`

Name of the model endpoint to be deleted

required

Returns:

Name Type Description response DeleteLLMEndpointResponse

whether the model endpoint was successfully deleted

Deleting model in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.delete(\"llama-2-7b.suffix.2023-07-18-12-00-00\")\nprint(response.json())\n
{\n    \"deleted\": true\n}\n
"},{"location":"api/python_client/#llmengine.Model.download","title":"download classmethod","text":"
download(\n    model_name: str, download_format: str = \"hugging_face\"\n) -> ModelDownloadResponse\n

Download a fine-tuned model.

This API can be used to download the resulting model from a fine-tuning job. It takes the model_name and download_format as parameter and returns a response object which contains a dictonary of filename, url pairs associated with the fine-tuned model. The user can then download these urls to obtain the fine-tuned model. If called on a nonexistent model, an error will be thrown.

Parameters:

Name Type Description Default model_name `str`

name of the fine-tuned model

required download_format `str`

download format requested (default=hugging_face)

'hugging_face'

Returns: DownloadModelResponse: an object that contains a dictionary of filenames, urls from which to download the model weights. The urls are presigned urls that grant temporary access and expire after an hour.

Downloading model in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.download(\"llama-2-7b.suffix.2023-07-18-12-00-00\", download_format=\"hugging_face\")\nprint(response.json())\n
{\n    \"urls\": {\"my_model_file\": \"https://url-to-my-model-weights\"}\n}\n
"},{"location":"api/python_client/#llmengine.File","title":"File","text":"

Bases: APIEngine

File API. This API is used to upload private files to LLM engine so that fine-tunes can access them for training and validation data.

Functions are provided to upload, get, list, and delete files, as well as to get the contents of a file.

"},{"location":"api/python_client/#llmengine.File.upload","title":"upload classmethod","text":"
upload(file: BufferedReader) -> UploadFileResponse\n

Uploads a file to LLM engine.

For use in FineTune creation, this should be a CSV file with two columns: prompt and response. A maximum of 100,000 rows of data is currently supported.

Parameters:

Name Type Description Default file `BufferedReader`

A local file opened with open(file_path, \"r\")

required

Returns:

Name Type Description UploadFileResponse UploadFileResponse

an object that contains the ID of the uploaded file

Uploading file in PythonResponse in JSON
from llmengine import File\n\nresponse = File.upload(open(\"training_dataset.csv\", \"r\"))\n\nprint(response.json())\n
{\n    \"id\": \"file-abc123\"\n}\n
"},{"location":"api/python_client/#llmengine.File.get","title":"get classmethod","text":"
get(file_id: str) -> GetFileResponse\n

Get file metadata, including filename and size.

Parameters:

Name Type Description Default file_id `str`

ID of the file

required

Returns:

Name Type Description GetFileResponse GetFileResponse

an object that contains the ID, filename, and size of the requested file

Getting metadata about file in PythonResponse in JSON
from llmengine import File\n\nresponse = File.get(\n    file_id=\"file-abc123\",\n)\n\nprint(response.json())\n
{\n    \"id\": \"file-abc123\",\n    \"filename\": \"training_dataset.csv\",\n    \"size\": 100\n}\n
"},{"location":"api/python_client/#llmengine.File.download","title":"download classmethod","text":"
download(file_id: str) -> GetFileContentResponse\n

Get contents of a file, as a string. (If the uploaded file is in binary, a string encoding will be returned.)

Parameters:

Name Type Description Default file_id `str`

ID of the file

required

Returns:

Name Type Description GetFileContentResponse GetFileContentResponse

an object that contains the ID and content of the file

Getting file content in PythonResponse in JSON
from llmengine import File\n\nresponse = File.download(file_id=\"file-abc123\")\nprint(response.json())\n
{\n    \"id\": \"file-abc123\",\n    \"content\": \"Hello world!\"\n}\n
"},{"location":"api/python_client/#llmengine.File.list","title":"list classmethod","text":"
list() -> ListFilesResponse\n

List metadata about all files, e.g. their filenames and sizes.

Returns:

Name Type Description ListFilesResponse ListFilesResponse

an object that contains a list of all files and their filenames and sizes

Listing files in PythonResponse in JSON
from llmengine import File\n\nresponse = File.list()\nprint(response.json())\n
{\n    \"files\": [\n        {\n            \"id\": \"file-abc123\",\n            \"filename\": \"training_dataset.csv\",\n            \"size\": 100\n},\n        {\n            \"id\": \"file-def456\",\n            \"filename\": \"validation_dataset.csv\",\n            \"size\": 50\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.File.delete","title":"delete classmethod","text":"
delete(file_id: str) -> DeleteFileResponse\n

Deletes a file.

Parameters:

Name Type Description Default file_id `str`

ID of the file

required

Returns:

Name Type Description DeleteFileResponse DeleteFileResponse

an object that contains whether the deletion was successful

Deleting file in PythonResponse in JSON
from llmengine import File\n\nresponse = File.delete(file_id=\"file-abc123\")\nprint(response.json())\n
{\n    \"deleted\": true\n}\n
"},{"location":"guides/completions/","title":"Completions","text":"

Language Models are trained to predict natural language and provide text outputs as a response to their inputs. The inputs are called prompts and outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words. This process is language dependent.

Scale's LLM Engine provides access to open source language models (see Model Zoo) that can be used for producing completions to prompts.

"},{"location":"guides/completions/#completion-api-call","title":"Completion API call","text":"

An example API call looks as follows:

Completion call in Python
from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Hello, my name is\",\n    max_new_tokens=10,\n    temperature=0.2,\n)\n\nprint(response.json())\n# '{\"request_id\": \"c4bf0732-08e0-48a8-8b44-dfe8d4702fb0\", \"output\": {\"text\": \"________ and I am a ________\", \"num_completion_tokens\": 10}}'\nprint(response.output.text)\n# ________ and I am a ________\n

See the full Completion API reference documentation to learn more.

"},{"location":"guides/completions/#completion-api-response","title":"Completion API response","text":"

An example Completion API response looks as follows:

Response in JSONResponse in Python
    >>> print(response.json())\n    {\n      \"request_id\": \"c4bf0732-08e0-48a8-8b44-dfe8d4702fb0\",\n      \"output\": {\n        \"text\": \"_______ and I am a _______\",\n        \"num_completion_tokens\": 10\n      }\n    }\n
    >>> print(response.output.text)\n    _______ and I am a _______\n
"},{"location":"guides/completions/#token-streaming","title":"Token streaming","text":"

The Completions API supports token streaming to reduce perceived latency for certain applications. When streaming, tokens will be sent as data-only server-side events.

To enable token streaming, pass stream=True to either Completion.create or Completion.acreate.

Note that errors from streaming calls are returned back to the user as plain-text messages and currently need to be handled by the client.

An example of token streaming using the synchronous Completions API looks as follows:

Token streaming with synchronous API in python
import sys\nfrom llmengine import Completion\n\nstream = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Give me a 200 word summary on the current economic events in the US.\",\n    max_new_tokens=1000,\n    temperature=0.2,\n    stream=True,\n)\n\nfor response in stream:\n    if response.output:\n        print(response.output.text, end=\"\")\n        sys.stdout.flush()\n    else: # an error occurred\nprint(response.error) # print the error message out \nbreak\n
"},{"location":"guides/completions/#async-requests","title":"Async requests","text":"

The Python client supports asyncio for creating Completions. Use Completion.acreate instead of Completion.create to utilize async processing. The function signatures are otherwise identical.

An example of async Completions looks as follows:

Completions with asynchronous API in python
import asyncio\nfrom llmengine import Completion\n\nasync def main():\n    response = await Completion.acreate(\n        model=\"llama-2-7b\",\n        prompt=\"Hello, my name is\",\n        max_new_tokens=10,\n        temperature=0.2,\n    )\n    print(response.json())\n\nasyncio.run(main())\n
"},{"location":"guides/completions/#which-model-should-i-use","title":"Which model should I use?","text":"

See the Model Zoo for more information on best practices for which model to use for Completions.

"},{"location":"guides/endpoint_creation/","title":"Endpoint creation","text":"

When creating a model endpoint, you can periodically poll the model status field to track the status of your model endpoint. In general, you'll need to wait after the model creation step for the model endpoint to be ready and available for use. An example is provided below:

model_name = \"test_deploy\"\nmodel = Model.create(name=model_name, model=\"llama-2-7b\", inference_frame_image_tag=\"0.9.4\")\nresponse = Model.get(model_name)\nwhile response.status.name != \"READY\":\n    print(response.status.name)\n    time.sleep(60)\n    response = Model.get(model_name)\n

Once the endpoint status is ready, you can use your newly created model for inference.

"},{"location":"guides/fine_tuning/","title":"Fine-tuning","text":"

Learn how to customize your models on your data with fine-tuning. Or get started right away with our fine-tuning cookbook.

"},{"location":"guides/fine_tuning/#introduction","title":"Introduction","text":"

Fine-tuning helps improve model performance by training on specific examples of prompts and desired responses. LLMs are initially trained on data collected from the entire internet. With fine-tuning, LLMs can be optimized to perform better in a specific domain by learning from examples for that domain. Smaller LLMs that have been fine-tuned on a specific use case often outperform larger ones that were trained more generally.

Fine-tuning allows for:

  1. Higher quality results than prompt engineering alone
  2. Cost savings through shorter prompts
  3. The ability to reach equivalent accuracy with a smaller model
  4. Lower latency at inference time
  5. The chance to show an LLM more examples than can fit in a single context window

LLM Engine's fine-tuning API lets you fine-tune various open source LLMs on your own data and then make inference calls to the resulting LLM. For more specific details, see the fine-tuning API reference.

"},{"location":"guides/fine_tuning/#producing-high-quality-data-for-fine-tuning","title":"Producing high quality data for fine-tuning","text":"

The training data for fine-tuning should consist of prompt and response pairs.

As a rule of thumb, you should expect to see linear improvements in your fine-tuned model's quality with each doubling of the dataset size. Having high-quality data is also essential to improving performance. For every linear increase in the error rate in your training data, you may encounter a roughly quadratic increase in your fine-tuned model's error rate.

High quality data is critical to achieve improved model performance, and in several cases will require experts to generate and prepare data - the breadth and diversity of the data is highly critical. Scale's Data Engine can help prepare such high quality, diverse data sets - more information here.

"},{"location":"guides/fine_tuning/#preparing-data","title":"Preparing data","text":"

Your data must be formatted as a CSV file that includes two columns: prompt and response. A maximum of 100,000 rows of data is currently supported. At least 200 rows of data is recommended to start to see benefits from fine-tuning. LLM Engine supports fine-tuning with a training and validation dataset. If only a training dataset is provided, 10% of the data is randomly split to be used as validation.

Here is an example script to create a 50-row CSV of properly formatted data for fine-tuning an airline question answering bot

Creating a sample dataset
import csv\n# Define data\ndata = [\n    (\"What is your policy on carry-on luggage?\", \"Our policy allows each passenger to bring one piece of carry-on luggage and one personal item such as a purse or briefcase. The maximum size for carry-on luggage is 22 x 14 x 9 inches.\"),\n    (\"How can I change my flight?\", \"You can change your flight through our website or mobile app. Go to 'Manage my booking' section, enter your booking reference and last name, then follow the prompts to change your flight.\"),\n    (\"What meals are available on my flight?\", \"We offer a variety of meals depending on the flight's duration and route. These can range from snacks and light refreshments to full-course meals on long-haul flights. Specific meal options can be viewed during the booking process.\"),\n    (\"How early should I arrive at the airport before my flight?\", \"We recommend arriving at least two hours before domestic flights and three hours before international flights.\"),\n    (\"Can I select my seat in advance?\", \"Yes, you can select your seat during the booking process or afterwards via the 'Manage my booking' section on our website or mobile app.\"),\n    (\"What should I do if my luggage is lost?\", \"If your luggage is lost, please report this immediately at our 'Lost and Found' counter at the airport. We will assist you in tracking your luggage.\"),\n    (\"Do you offer special assistance for passengers with disabilities?\", \"Yes, we offer special assistance for passengers with disabilities. Please notify us of your needs at least 48 hours prior to your flight.\"),\n    (\"Can I bring my pet on the flight?\", \"Yes, we allow small pets in the cabin, and larger pets in the cargo hold. Please check our pet policy for more details.\"),\n    (\"What is your policy on flight cancellations?\", \"In case of flight cancellations, we aim to notify passengers as early as possible and offer either a refund or a rebooking on the next available flight.\"),\n    (\"Can I get a refund if I cancel my flight?\", \"Refunds depend on the type of ticket purchased. Please check our cancellation policy for details. Non-refundable tickets, however, are typically not eligible for refunds unless due to extraordinary circumstances.\"),\n    (\"How can I check-in for my flight?\", \"You can check-in for your flight either online, through our mobile app, or at the airport. Online and mobile app check-in opens 24 hours before departure and closes 90 minutes before.\"),\n    (\"Do you offer free meals on your flights?\", \"Yes, we serve free meals on all long-haul flights. For short-haul flights, we offer a complimentary drink and snack. Special meal requests should be made at least 48 hours before departure.\"),\n    (\"Can I use my electronic devices during the flight?\", \"Small electronic devices can be used throughout the flight in flight mode. Larger devices like laptops may be used above 10,000 feet.\"),\n    (\"How much baggage can I check-in?\", \"The checked baggage allowance depends on the class of travel and route. The details would be mentioned on your ticket, or you can check on our website.\"),\n    (\"How can I request for a wheelchair?\", \"To request a wheelchair or any other special assistance, please call our customer service at least 48 hours before your flight.\"),\n    (\"Do I get a discount for group bookings?\", \"Yes, we offer discounts on group bookings of 10 or more passengers. Please contact our group bookings team for more information.\"),\n    (\"Do you offer Wi-fi on your flights?\", \"Yes, we offer complimentary Wi-fi on select flights. You can check the availability during the booking process.\"),\n    (\"What is the minimum connecting time between flights?\", \"The minimum connecting time varies depending on the airport and whether your flight is international or domestic. Generally, it's recommended to allow at least 45-60 minutes for domestic connections and 60-120 minutes for international.\"),\n    (\"Do you offer duty-free shopping on international flights?\", \"Yes, we have a selection of duty-free items that you can pre-order on our website or purchase onboard on international flights.\"),\n    (\"Can I upgrade my ticket to business class?\", \"Yes, you can upgrade your ticket through the 'Manage my booking' section on our website or by contacting our customer service. The availability and costs depend on the specific flight.\"),\n    (\"Can unaccompanied minors travel on your flights?\", \"Yes, we do accommodate unaccompanied minors on our flights, with special services to ensure their safety and comfort. Please contact our customer service for more details.\"),\n    (\"What amenities do you provide in business class?\", \"In business class, you will enjoy additional legroom, reclining seats, premium meals, priority boarding and disembarkation, access to our business lounge, extra baggage allowance, and personalized service.\"),\n    (\"How much does extra baggage cost?\", \"Extra baggage costs vary based on flight route and the weight of the baggage. Please refer to our 'Extra Baggage' section on the website for specific rates.\"),\n    (\"Are there any specific rules for carrying liquids in carry-on?\", \"Yes, liquids carried in your hand luggage must be in containers of 100 ml or less and they should all fit into a single, transparent, resealable plastic bag of 20 cm x 20 cm.\"),\n    (\"What if I have a medical condition that requires special assistance during the flight?\", \"We aim to make the flight comfortable for all passengers. If you have a medical condition that may require special assistance, please contact our \u2018special services\u2019 team 48 hours before your flight.\"),\n    (\"What in-flight entertainment options are available?\", \"We offer a range of in-flight entertainment options including a selection of movies, TV shows, music, and games, available on your personal seat-back screen.\"),\n    (\"What types of payment methods do you accept?\", \"We accept credit/debit cards, PayPal, bank transfers, and various other forms of payment. The available options may vary depending on the country of departure.\"),\n    (\"How can I earn and redeem frequent flyer miles?\", \"You can earn miles for every journey you take with us or our partner airlines. These miles can be redeemed for flight tickets, upgrades, or various other benefits. To earn and redeem miles, you need to join our frequent flyer program.\"),\n    (\"Can I bring a stroller for my baby?\", \"Yes, you can bring a stroller for your baby. It can be checked in for free and will normally be given back to you at the aircraft door upon arrival.\"),\n    (\"What age does my child have to be to qualify as an unaccompanied minor?\", \"Children aged between 5 and 12 years who are traveling alone are considered unaccompanied minors. Our team provides special care for these children from departure to arrival.\"),\n    (\"What documents do I need to travel internationally?\", \"For international travel, you need a valid passport and may also require visas, depending on your destination and your country of residence. It's important to check the specific requirements before you travel.\"),\n    (\"What happens if I miss my flight?\", \"If you miss your flight, please contact our customer service immediately. Depending on the circumstances, you may be able to rebook on a later flight, but additional fees may apply.\"),\n    (\"Can I travel with my musical instrument?\", \"Yes, small musical instruments can be brought on board as your one carry-on item. Larger instruments must be transported in the cargo, or if small enough, a seat may be purchased for them.\"),\n    (\"Do you offer discounts for children or infants?\", \"Yes, children aged 2-11 traveling with an adult usually receive a discount on the fare. Infants under the age of 2 who do not occupy a seat can travel for a reduced fare or sometimes for free.\"),\n    (\"Is smoking allowed on your flights?\", \"No, all our flights are non-smoking for the comfort and safety of all passengers.\"),\n    (\"Do you have family seating?\", \"Yes, we offer the option to seat families together. You can select seats during booking or afterwards through the 'Manage my booking' section on the website.\"),\n    (\"Is there any discount for senior citizens?\", \"Some flights may offer a discount for senior citizens. Please check our website or contact customer service for accurate information.\"),\n    (\"What items are prohibited on your flights?\", \"Prohibited items include, but are not limited to, sharp objects, firearms, explosive materials, and certain chemicals. You can find a comprehensive list on our website under the 'Security Regulations' section.\"),\n    (\"Can I purchase a ticket for someone else?\", \"Yes, you can purchase a ticket for someone else. You'll need their correct name as it appears on their government-issued ID, and their correct travel dates.\"),\n    (\"What is the process for lost and found items on the plane?\", \"If you realize you forgot an item on the plane, report it as soon as possible to our lost and found counter. We will make every effort to locate and return your item.\"),\n    (\"Can I request a special meal?\", \"Yes, we offer a variety of special meals to accommodate dietary restrictions. Please request your preferred meal at least 48 hours prior to your flight.\"),\n    (\"Is there a weight limit for checked baggage?\", \"Yes, luggage weight limits depend on your ticket class and route. You can find the details on your ticket or by visiting our website.\"),\n    (\"Can I bring my sports equipment?\", \"Yes, certain types of sports equipment can be carried either as or in addition to your permitted baggage. Some equipment may require additional fees. It's best to check our policy on our website or contact us directly.\"),\n    (\"Do I need a visa to travel to certain countries?\", \"Yes, visa requirements depend on the country you are visiting and your nationality. We advise checking with the relevant embassy or consulate prior to travel.\"),\n    (\"How can I add extra baggage to my booking?\", \"You can add extra baggage to your booking through the 'Manage my booking' section on our website or by contacting our customer services.\"),\n    (\"Can I check-in at the airport?\", \"Yes, you can choose to check-in at the airport. However, we also offer online and mobile check-in, which may save you time.\"),\n    (\"How do I know if my flight is delayed or cancelled?\", \"In case of any changes to your flight, we will attempt to notify all passengers using the contact information given at the time of booking. You can also check your flight status on our website.\"),\n    (\"What is your policy on pregnant passengers?\", \"Pregnant passengers can travel up to the end of the 36th week for single pregnancies, and the end of the 32nd week for multiple pregnancies. We recommend consulting your doctor before any air travel.\"),\n    (\"Can children travel alone?\", \"Yes, children age 5 to 12 can travel alone as unaccompanied minors. We provide special care for these seats. Please contact our customer service for more information.\"),\n    (\"How can I pay for my booking?\", \"You can pay for your booking using a variety of methods including credit and debit cards, PayPal, or bank transfers. The options may vary depending on the country of departure.\"),\n]\n\n# Write data to a CSV file\nwith open('customer_service_data.csv', 'w', newline='') as file:\n    writer = csv.writer(file)\n    writer.writerow([\"prompt\", \"response\"])\n    writer.writerows(data)\n
"},{"location":"guides/fine_tuning/#making-your-data-accessible-to-llm-engine","title":"Making your data accessible to LLM Engine","text":"

Currently, data needs to be uploaded to either a publicly accessible web URL or to LLM Engine's private file server so that it can be read for fine-tuning. Publicly accessible HTTP and HTTPS URLs are currently supported.

To privately share data with the LLM Engine API, use LLM Engine's File.upload API. You can upload data in local file to LLM Engine's private file server and then use the returned file ID to reference your data in the FineTune API. The file ID is generally in the form of file-<random_string>, e.g. \"file-7DLVeLdN2Ty4M2m\".

Upload to LLM Engine's private file server
from llmengine import File\n\nresponse = File.upload(open(\"customer_service_data.csv\", \"r\"))\nprint(response.json())\n
"},{"location":"guides/fine_tuning/#launching-the-fine-tune","title":"Launching the fine-tune","text":"

Once you have uploaded your data, you can use the LLM Engine's FineTune.Create API to launch a fine-tune. You will need to specify which base model to fine-tune, the locations of the training file and optional validation data file, an optional set of hyperparameters to customize the fine-tuning behavior, and an optional suffix to append to the name of the fine-tune. For sequences longer than the native max_seq_length of the model, the sequences will be truncated.

If you specify a suffix, the fine-tune will be named model.suffix.<timestamp>. If you do not, the fine-tune will be named model.<timestamp>. The timestamp will be the time the fine-tune was launched. Note: the suffix must only contain alphanumeric characters and hyphens, and be at most 28 characters long.

Hyper-parameters for fine-tune - `lr`: Peak learning rate used during fine-tuning. It decays with a cosine schedule afterward. (Default: 2e-3) - `warmup_ratio`: Ratio of training steps used for learning rate warmup. (Default: 0.03) - `epochs`: Number of fine-tuning epochs. This should be less than 20. (Default: 5) - `weight_decay`: Regularization penalty applied to learned weights. (Default: 0.001) Create a fine-tune in python
from llmengine import FineTune\n\nresponse = FineTune.create(\n    model=\"llama-2-7b\",\n    training_file=\"file-AbCDeLdN2Ty4M2m\",\n    validation_file=\"file-ezSRpgtKQyItI26\",\n)\n\nprint(response.json())\n

See the Model Zoo to see which models have fine-tuning support.

See Integrations to see how to track fine-tuning metrics.

"},{"location":"guides/fine_tuning/#monitoring-the-fine-tune","title":"Monitoring the fine-tune","text":"

Once the fine-tune is launched, you can also get the status of your fine-tune. You can also list events that your fine-tune produces.

from llmengine import FineTune\n\nfine_tune_id = \"ft-cabcdefghi1234567890\"\nfine_tune = FineTune.get(fine_tune_id)\nprint(fine_tune.status)  # BatchJobStatus.RUNNING\nprint(fine_tune.fine_tuned_model)  # \"llama-2-7b.700101-000000\nfine_tune_events = FineTune.get_events(fine_tune_id)\nfor event in fine_tune_events.events:\n    print(event)\n# Prints something like:\n# timestamp=1697590000.0 message=\"{'loss': 12.345, 'learning_rate': 0.0, 'epoch': 0.97}\" level='info'\n# timestamp=1697590000.0 message=\"{'eval_loss': 23.456, 'eval_runtime': 19.876, 'eval_samples_per_second': 4.9, 'eval_steps_per_second': 4.9, 'epoch': 0.97}\" level='info'\n# timestamp=1697590020.0 message=\"{'train_runtime': 421.234, 'train_samples_per_second': 2.042, 'train_steps_per_second': 0.042, 'total_flos': 123.45, 'train_loss': 34.567, 'epoch': 0.97}\" level='info'\n

The status of your fine-tune will give a high-level overview of the fine-tune's progress. The events of your fine-tune will give more detail, such as the training loss and validation loss at each epoch, as well as any errors that may have occurred. If you encounter any errors with your fine-tune, the events are a good place to start debugging. For example, if you see Unable to read training or validation dataset, you may need to make your files accessible to LLM Engine. If you see Invalid value received for lora parameter 'lora_alpha'!, you should check that your hyperparameters are valid.

"},{"location":"guides/fine_tuning/#making-inference-calls-to-your-fine-tune","title":"Making inference calls to your fine-tune","text":"

Once your fine-tune is finished, you will be able to start making inference requests to the model. You can use the fine_tuned_model returned from your FineTune.get API call to reference your fine-tuned model in the Completions API. Alternatively, you can list available LLMs with Model.list in order to find the name of your fine-tuned model. See the Completion API for more details. You can then use that name to direct your completion requests. You must wait until your fine-tune is complete before you can plug it into the Completions API. You can check the status of your fine-tune with FineTune.get.

Inference with a fine-tuned model in python
from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b.airlines.2023-07-17-08-30-45\",\n    prompt=\"Do you offer in-flight Wi-fi?\",\n    max_new_tokens=100,\n    temperature=0.2,\n)\nprint(response.json())\n
"},{"location":"guides/rate_limits/","title":"Overview","text":""},{"location":"guides/rate_limits/#what-are-rate-limits","title":"What are rate limits?","text":"

A rate limit is a restriction that an API imposes on the number of times a user or client can access the server within a specified period of time.

"},{"location":"guides/rate_limits/#how-do-i-know-if-i-am-rate-limited","title":"How do I know if I am rate limited?","text":"

Per standard HTTP practices, your request will receive a response with HTTP status code of 429, Too Many Requests.

"},{"location":"guides/rate_limits/#what-are-the-rate-limits-for-our-api","title":"What are the rate limits for our API?","text":"

The LLM Engine API is currently in a preview mode, and therefore we currently do not have any advertised rate limits. As the API moves towards a production release, we will update this section with specific rate limits. For now, the API will return HTTP 429 on an as-needed basis.

"},{"location":"guides/rate_limits/#error-mitigation","title":"Error mitigation","text":""},{"location":"guides/rate_limits/#retrying-with-exponential-backoff","title":"Retrying with exponential backoff","text":"

One easy way to avoid rate limit errors is to automatically retry requests with a random exponential backoff. Retrying with exponential backoff means performing a short sleep when a rate limit error is hit, then retrying the unsuccessful request. If the request is still unsuccessful, the sleep length is increased and the process is repeated. This continues until the request is successful or until a maximum number of retries is reached. This approach has many benefits:

Below are a few example solutions for Python that use exponential backoff.

"},{"location":"guides/rate_limits/#example-1-using-the-tenacity-library","title":"Example #1: Using the tenacity library","text":"

Tenacity is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything. To add exponential backoff to your requests, you can use the tenacity.retry decorator. The below example uses the tenacity.wait_random_exponential function to add random exponential backoff to a request.

Exponential backoff in python
import llmengine\nfrom tenacity import (\n    retry,\n    stop_after_attempt,\n    wait_random_exponential,\n)  # for exponential backoff\n@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))\ndef completion_with_backoff(**kwargs):\n    return llmengine.Completion.create(**kwargs)\n\ncompletion_with_backoff(model=\"llama-2-7b\", prompt=\"Why is the sky blue?\")\n
"},{"location":"guides/rate_limits/#example-2-using-the-backoff-library","title":"Example #2: Using the backoff library","text":"

Backoff is another python library that provides function decorators which can be used to wrap a function such that it will be retried until some condition is met.

Decorators for backoff and retry in python
import llmengine\nimport backoff\n@backoff.on_exception(backoff.expo, llmengine.errors.RateLimitExceededError)\ndef completion_with_backoff(**kwargs):\n    return llmengine.Completion.create(**kwargs)\n\ncompletions_with_backoff(model=\"llama-2-7b\", prompt=\"Why is the sky blue?\")\n
"},{"location":"guides/self_hosting/","title":"Self Hosting [Experimental]","text":"

This guide is currently highly experimental. Instructions are subject to change as we improve support for self-hosting.

We provide a Helm chart that deploys LLM Engine to an Elastic Kubernetes Cluster (EKS) in AWS. This Helm chart should be configured to connect to dependencies (such as a PostgreSQL database) that you may already have available in your environment.

The only portions of the Helm chart that are production ready are the parts that configure and manage LLM Server itself (not PostgreSQL, IAM, etc.)

We first go over required AWS dependencies that are required to exist before we can run helm install in your EKS cluster.

"},{"location":"guides/self_hosting/#aws-dependencies","title":"AWS Dependencies","text":"

This section describes assumptions about existing AWS resources required run to the LLM Engine Server

"},{"location":"guides/self_hosting/#eks","title":"EKS","text":"

The LLM Engine server must be deployed in an EKS cluster environment. Currently only versions 1.23+ are supported. Below are the assumed requirements for the EKS cluster:

You will need to provision EKS node groups with GPUs to schedule model pods. These node groups must have the node-lifecycle: normal label on them. Additionally, they must have the k8s.amazonaws.com/accelerator label set appropriately depending on the instance type:

Instance family k8s.amazonaws.com/accelerator label g4dn nvidia-tesla-t4 g5 nvidia-tesla-a10 p4d nvidia-tesla-a100 p4de nvidia-tesla-a100e

We also recommend setting the following taint on your GPU nodes to prevent pods requiring GPU resources from being scheduled on them: - { key = \"nvidia.com/gpu\", value = \"true\", effect = \"NO_SCHEDULE\" }

"},{"location":"guides/self_hosting/#postgresql","title":"PostgreSQL","text":"

The LLM Engine server requires a PostgreSQL database to back data. LLM Engine currently supports PostgreSQL version 14. Create a PostgreSQL database (e.g. AWS RDS PostgreSQL) if you do not have an existing one you wish to connect LLM Engine to.

To enable LLM Engine to connect to the PostgreSQL engine, we create a Kubernetes secret with the PostgreSQL url. An example YAML is provided below:

apiVersion: v1\nkind: Secret\nmetadata:\n  name: llm-engine-database-credentials  # this name will be an input to our Helm Chart\ndata:\n    database_url = \"postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]\"\n

"},{"location":"guides/self_hosting/#redis","title":"Redis","text":"

The LLM Engine server requires Redis for various caching/queue functionality. LLM Engine currently supports Redis version 6. Create a Redis cluster (e.g. AWS Elasticache for Redis) if you do not have an existing one you wish to connect LLM Engine to.

To enable LLM Engine to connect redis, fill out the Helm chart values with the redis host and url.

"},{"location":"guides/self_hosting/#amazon-s3","title":"Amazon S3","text":"

You will need to have an S3 bucket for LLM Engine to store various assets (e.g model weigts, prediction restuls). The ARN of this bucket should be provided in the Helm chart values.

"},{"location":"guides/self_hosting/#amazon-ecr","title":"Amazon ECR","text":"

You will need to provide an ECR repository for the deployment to store model containers. The ARN of this repository should be provided in the Helm chart values.

"},{"location":"guides/self_hosting/#amazon-sqs","title":"Amazon SQS","text":"

LLM Engine utilizes Amazon SQS to keep track of jobs. LLM Engine will create and use SQS queues as needed.

"},{"location":"guides/self_hosting/#identity-and-access-management-iam","title":"Identity and Access Management (IAM)","text":"

The LLM Engine server will an IAM role to perform various AWS operations. This role will be assumed by the serviceaccount llm-engine in the launch namespace in the EKS cluster. The ARN of this role needs to be provided to the Helm chart, and the role needs to be provided the following permissions:

Action Resource s3:Get*, s3:Put* ${s3_bucket_arn}/* s3:List* ${s3_bucket_arn} sqs:* arn:aws:sqs:${region}:${account_id}:llm-engine-endpoint-id-* sqs:ListQueues * ecr:BatchGetImage, ecr:DescribeImages, ecr:GetDownloadUrlForLayer, ecr:ListImages ${ecr_repository_arn}"},{"location":"guides/self_hosting/#helm-chart","title":"Helm Chart","text":"

Now that all dependencies have been installed and configured, we can run the provided Helm chart. The values in the Helm chart will need to correspond with the resources described in the Dependencies section.

Ensure that Helm V3 is installed instructions and can connect to the EKS cluster. Users should be able to install the chart with helm install llm-engine llm-engine -f llm-engine/values_sample.yaml -n <DESIRED_NAMESPACE>. Below are the configurations to specify in the values_sample.yaml file.

Parameter Description Required tag The LLM Engine docker image tag Yes context A user-specified deployment tag No image.gatewayRepository The docker repository to pull the LLM Engine gateway image from Yes image.builderRepository The docker repository to pull the LLM Engine endpoint builder image from Yes image.cacherRepository The docker repository to pull the LLM Engine cacher image from Yes image.forwarderRepository The docker repository to pull the LLM Engine forwarder image from Yes image.pullPolicy The docker image pull policy No secrets.kubernetesDatabaseSecretName The name of the secret that contains the database credentials Yes serviceAccount.annotations.eks.amazonaws.com/role-arn The ARN of the IAM role that the service account will assume Yes service.type The service configuration for the main LLM Engine server No service.port The service configuration for the main LLM Engine server No replicaCount The amount of replica pods for each deployment No autoscaling The autoscaling configuration for LLM Engine server deployments No resources.requests.cpu The k8s resources for LLM Engine server deployments No nodeSelector The node selector for LLM Engine server deployments No tolerations The tolerations for LLM Engine server deployments No affinity The affinity for LLM Engine server deployments No aws.configMap.name The AWS configurations (by configMap) for LLM Engine server deployments No aws.configMap.create The AWS configurations (by configMap) for LLM Engine server deployments No aws.profileName The AWS configurations (by configMap) for LLM Engine server deployments No serviceTemplate.securityContext.capabilities.drop Additional flags for model endpoints No serviceTemplate.mountInfraConfig Additional flags for model endpoints No config.values.infra.k8s_cluster_name The name of the k8s cluster Yes config.values.infra.dns_host_domain The domain name of the k8s cluster Yes config.values.infra.default_region The default AWS region for various resources Yes config.values.infra.ml_account_id The AWS account ID for various resources Yes config.values.infra.docker_repo_prefix The prefix for AWS ECR repositories Yes config.values.infra.redis_host The hostname of the redis cluster you wish to connect Yes config.values.infra.s3_bucket The S3 bucket you wish to connect Yes config.values.llm_engine.endpoint_namespace K8s namespace the endpoints will be created in Yes config.values.llm_engine.cache_redis_url The full url for the redis cluster you wish to connect Yes config.values.llm_engine.s3_file_llm_fine_tuning_job_repository The S3 URI for the S3 bucket/key that you wish to save fine-tuned assets Yes config.values.dd_trace_enabled Whether to enable datadog tracing, datadog must be installed in the cluster No"},{"location":"guides/self_hosting/#play-with-it","title":"Play With It","text":"

Once helm install succeeds, you can forward port 5000 from a llm-engine pod and test sending requests to it.

First, see a list of pods in the namespace that you performed helm install in:

$ kubectl get pods -n <NAMESPACE_WHERE_LLM_ENGINE_IS_INSTALLED>\nNAME                                           READY   STATUS             RESTARTS      AGE\nllm-engine-668679554-9q4wj                     1/1     Running            0             18m\nllm-engine-668679554-xfhxx                     1/1     Running            0             18m\nllm-engine-cacher-5f8b794585-fq7dj             1/1     Running            0             18m\nllm-engine-endpoint-builder-5cd6bf5bbc-sm254   1/1     Running            0             18m\nllm-engine-image-cache-a10-sw4pg               1/1     Running            0             18m \n
Note the pod names you see may be different.

Forward a port from a llm-engine pod:

$ kubectl port-forward pod/llm-engine-<REST_OF_POD_NAME> 5000:5000 -n <NAMESPACE_WHERE_LLM_ENGINE_IS_INSTALLED>\n

Then, try sending a request to get LLM model endpoints for test-user-id:

$ curl -X GET -H \"Content-Type: application/json\" -u \"test-user-id:\" \"http://localhost:5000/v1/llm/model-endpoints\"\n

You should get the following response:

{\"model_endpoints\":[]}\n

Next, let's create a LLM endpoint using llama-7b:

$ curl -X POST 'http://localhost:5000/v1/llm/model-endpoints' \\\n    -H 'Content-Type: application/json' \\\n    -d '{\n        \"name\": \"llama-7b\",\n        \"model_name\": \"llama-7b\",\n        \"source\": \"hugging_face\",\n        \"inference_framework\": \"text_generation_inference\",\n        \"inference_framework_image_tag\": \"0.9.3\",\n        \"num_shards\": 4,\n        \"endpoint_type\": \"streaming\",\n        \"cpus\": 32,\n        \"gpus\": 4,\n        \"memory\": \"40Gi\",\n        \"storage\": \"40Gi\",\n        \"gpu_type\": \"nvidia-ampere-a10\",\n        \"min_workers\": 1,\n        \"max_workers\": 12,\n        \"per_worker\": 1,\n        \"labels\": {},\n        \"metadata\": {}\n    }' \\\n    -u test_user_id:\n

It should output something like:

{\"endpoint_creation_task_id\":\"8d323344-b1b5-497d-a851-6d6284d2f8e4\"}\n

Wait a few minutes for the endpoint to be ready. You can tell that it's ready by listing pods and checking that all containers in the llm endpoint pod are ready:

$ kubectl get pods -n <endpoint_namespace specified in values_sample.yaml>\nNAME                                                              READY   STATUS    RESTARTS        AGE\nllm-engine-endpoint-id-end-cismpd08agn003rr2kc0-7f86ff64f9qj9xp   2/2     Running   1 (4m41s ago)   7m26s\n
Note the endpoint name could be different.

Then, you can send an inference request to the endppoint:

$ curl -X POST 'http://localhost:5000/v1/llm/completions-sync?model_endpoint_name=llama-7b' \\\n    -H 'Content-Type: application/json' \\\n    -d '{\n        \"prompts\": [\"Tell me a joke about AI\"],\n        \"max_new_tokens\": 30,\n        \"temperature\": 0.1\n    }' \\\n    -u test-user-id:\n

You should get a response similar to:

{\"status\":\"SUCCESS\",\"outputs\":[{\"text\":\". Tell me a joke about AI. Tell me a joke about AI. Tell me a joke about AI. Tell me\",\"num_completion_tokens\":30}],\"traceback\":null}\n

"},{"location":"guides/self_hosting/#pointing-llm-engine-client-to-use-self-hosted-infrastructure","title":"Pointing LLM Engine client to use self-hosted infrastructure","text":"

The llmengine client makes requests to Scale AI's hosted infrastructure by default. You can have llmengine client make requests to your own self-hosted infrastructure by setting the LLM_ENGINE_BASE_PATH environment variable to the URL of the llm-engine service.

The exact URL of llm-engine service depends on your Kubernetes cluster networking setup. The domain is specified at config.values.infra.dns_host_domain in the helm chart values config file. Using charts/llm-engine/values_sample.yaml as an example, you would do:

export LLM_ENGINE_BASE_PATH=https://llm-engine.domain.com\n

"},{"location":"guides/token_streaming/","title":"Token streaming","text":"

The Completions APIs support a stream boolean parameter that, when True, will return a streamed response of token-by-token server-sent events (SSEs) rather than waiting to receive the full response when model generation has finished. This decreases latency of when you start getting a response.

The response will consist of SSEs of the form {\"token\": dict, \"generated_text\": str | null, \"details\": dict | null}, where the dictionary for each token will contain log probability information in addition to the generated string; the generated_text field will be null for all but the last SSE, for which it will contain the full generated response.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":""},{"location":"#llm-engine","title":"LLM Engine","text":"

The open source engine for fine-tuning and serving large language models.

LLM Engine is the easiest way to customize and serve LLMs.

LLMs can be accessed via Scale's hosted version or by using the Helm charts in this repository to run model inference and fine-tuning in your own infrastructure.

"},{"location":"#quick-install","title":"Quick Install","text":"Install the python package
pip install scale-llm-engine\n
"},{"location":"#about","title":"About","text":"

Foundation models are emerging as the building blocks of AI. However, deploying these models to the cloud and fine-tuning them is an expensive operation that require infrastructure and ML expertise. It is also difficult to maintain over time as new models are released and new techniques for both inference and fine-tuning are made available.

LLM Engine is a Python library and Helm chart that provides everything you need to serve and fine-tune foundation models, whether you use Scale's hosted infrastructure or do it in your own cloud infrastructure using Kubernetes.

"},{"location":"#key-features","title":"Key Features","text":"

Ready-to-use APIs for your favorite models: Deploy and serve open source foundation models - including Llama-2, MPT, and Falcon. Use Scale-hosted models or deploy to your own infrastructure.

Fine-tune the best open-source models: Fine-tune open-source foundation models like Llama-2, MPT, etc. with your own data for optimized performance.

Optimized Inference: LLM Engine provides inference APIs for streaming responses and dynamically batching inputs for higher throughput and lower latency.

Open-Source Integrations: Deploy any Hugging Face model with a single command.

Deploying from any docker image: Turn any Docker image into an auto-scaling deployment with simple APIs.

"},{"location":"#features-coming-soon","title":"Features Coming Soon","text":"

Kubernetes Installation Enhancements: We are working hard to enhance the installation and maintenance of inference and fine-tuning functionality on your infrastructure. For now, our documentation covers experimental libraries to deploy language models on your infrastructure and libraries to access Scale's hosted infrastructure.

Fast Cold-Start Times: To prevent GPUs from idling, LLM Engine automatically scales your model to zero when it's not in use and scales up within seconds, even for large foundation models.

Cost Optimization: Deploy AI models cheaper than commercial ones, including cold-start and warm-down times.

"},{"location":"contributing/","title":"Contributing to LLM Engine","text":""},{"location":"contributing/#updating-llm-engine-documentation","title":"Updating LLM Engine Documentation","text":"

LLM Engine leverages mkdocs to create beautiful, community-oriented documentation.

"},{"location":"contributing/#step-1-clone-the-repository","title":"Step 1: Clone the Repository","text":"

Clone/Fork the LLM Engine Repository. Our documentation lives in the docs folder.

"},{"location":"contributing/#step-2-install-the-dependencies","title":"Step 2: Install the Dependencies","text":"

Dependencies are located in requirements-docs.txt, go ahead and pip install those with

pip install -r requirements-docs.txt\n
"},{"location":"contributing/#step-3-install-the-python-client-locally","title":"Step 3: Install the Python client locally","text":"

Our Python client API reference is autogenerated from our client. You can install the client in editable mode with

pip install -r clients/python\n
"},{"location":"contributing/#step-4-run-locally","title":"Step 4: Run Locally","text":"

To run the documentation service locally, execute the following command:

mkdocs serve\n

This should kick off a locally running instance on http://127.0.0.1:8000/.

As you edit the content in the docs folder, the site will be automatically reloaded on each file save.

"},{"location":"contributing/#step-5-editing-navigation-and-settings","title":"Step 5: Editing Navigation and Settings","text":"

If you are less familiar with mkdocs, in addition to the markdown content in the docs folder, there is a top-level mkdocs.yml file as well that defines the navigation pane and other website settings. If you don't see your page where you think it should be, double-check the .yml file.

"},{"location":"contributing/#step-6-building-and-deploying","title":"Step 6: Building and Deploying","text":"

CircleCI (via .circleci/config.yml) handles the building and deployment of our documentation service for us.

"},{"location":"faq/","title":"Frequently Asked Questions","text":""},{"location":"getting_started/","title":"Getting Started","text":"

The fastest way to get started with LLM Engine is to use the Python client in this repository to run inference and fine-tuning on Scale's infrastructure. This path does not require you to install anything on your infrastructure, and Scale's free research preview gives you access to experimentation using open source LLMs.

To start, install LLM Engine via pip:

pip
pip install scale-llm-engine\n
"},{"location":"getting_started/#scale-api-keys","title":"Scale API Keys","text":"

Next, you need a Scale Spellbook API key.

"},{"location":"getting_started/#retrieving-your-api-key","title":"Retrieving your API Key","text":"

To retrieve your API key, head to Scale Spellbook where you will get an API key on the settings page.

Different API Keys for different Scale Products

If you have leveraged Scale's platform for annotation work in the past, please note that your Spellbook API key will be different than the Scale Annotation API key. You will want to create a Spellbook API key before getting started.

"},{"location":"getting_started/#set-your-api-key","title":"Set your API Key","text":"

LLM Engine uses environment variables to access your API key.

Set this API key as the SCALE_API_KEY environment variable by running the following command in your terminal before you run your python application.

export SCALE_API_KEY=\"[Your API key]\"\n

You can also add in the line above to your .zshrc or .bash_profile so it's automatically set for future sessions.

Alternatively, you can also set your API key using either of the following patterns:

llmengine.api_engine.api_key = \"abc\"\nllmengine.api_engine.set_api_key(\"abc\")\n
These patterns are useful for Jupyter Notebook users to set API keys without the need for using os.environ.

"},{"location":"getting_started/#example-code","title":"Example Code","text":""},{"location":"getting_started/#sample-completion","title":"Sample Completion","text":"

With your API key set, you can now send LLM Engine requests using the Python client:

from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"I'm opening a pancake restaurant that specializes in unique pancake shapes, colors, and flavors. List 3 quirky names I could name my restaurant.\",\n    max_new_tokens=100,\n    temperature=0.2,\n)\n\nprint(response.output.text)\n
"},{"location":"getting_started/#with-streaming","title":"With Streaming","text":"
import sys\nfrom llmengine import Completion\n\nstream = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Give me a 200 word summary on the current economic events in the US.\",\n    max_new_tokens=1000,\n    temperature=0.2,\n    stream=True,\n)\n\nfor response in stream:\n    if response.output:\n        print(response.output.text, end=\"\")\n        sys.stdout.flush()\n    else: # an error occurred\nprint(response.error) # print the error message out \nbreak\n
"},{"location":"integrations/","title":"Integrations","text":""},{"location":"integrations/#weights-biases","title":"Weights & Biases","text":"

LLM Engine integrates with Weights & Biases to track metrics during fine tuning. To enable:

from llmengine import FineTune\n\nresponse = FineTune.create(\n    model=\"llama-2-7b\",\n    training_file=\"s3://my-bucket/path/to/training-file.csv\",\n    validation_file=\"s3://my-bucket/path/to/validation-file.csv\",\n    hyperparameters={\"report_to\": \"wandb\"},\n    wandb_config={\"api_key\":\"key\", \"project\":\"fine-tune project\"}\n)\n

Configs to specify:

"},{"location":"model_zoo/","title":"Public Model Zoo","text":"

Scale hosts the following models in the LLM Engine Model Zoo:

Model Name Inference APIs Available Fine-tuning APIs Available Inference Frameworks Available llama-7b \u2705 \u2705 deepspeed, text-generation-inference llama-2-7b \u2705 \u2705 text-generation-inference, vllm llama-2-7b-chat \u2705 text-generation-inference, vllm llama-2-13b \u2705 text-generation-inference, vllm llama-2-13b-chat \u2705 text-generation-inference, vllm llama-2-70b \u2705 \u2705 text-generation-inference, vllm llama-2-70b-chat \u2705 text-generation-inference, vllm falcon-7b \u2705 text-generation-inference, vllm falcon-7b-instruct \u2705 text-generation-inference, vllm falcon-40b \u2705 text-generation-inference, vllm falcon-40b-instruct \u2705 text-generation-inference, vllm mpt-7b \u2705 deepspeed, text-generation-inference, vllm mpt-7b-instruct \u2705 \u2705 deepspeed, text-generation-inference, vllm flan-t5-xxl \u2705 deepspeed, text-generation-inference mistral-7b \u2705 \u2705 vllm mistral-7b-instruct \u2705 \u2705 vllm codellama-7b \u2705 \u2705 text-generation-inference, vllm codellama-7b-instruct \u2705 \u2705 text-generation-inference, vllm codellama-13b \u2705 \u2705 text-generation-inference, vllm codellama-13b-instruct \u2705 \u2705 text-generation-inference, vllm codellama-34b \u2705 \u2705 text-generation-inference, vllm codellama-34b-instruct \u2705 \u2705 text-generation-inference, vllm zephyr-7b-alpha \u2705 text-generation-inference, vllm zephyr-7b-beta \u2705 text-generation-inference, vllm"},{"location":"model_zoo/#usage","title":"Usage","text":"

Each of these models can be used with the Completion API.

The specified models can be fine-tuned with the FineTune API.

More information about the models can be found using the Model API.

"},{"location":"pricing/","title":"Pricing","text":"

LLM Engine is an open-source project and free self-hosting will always be an option.

A hosted option for LLM Engine is being offered initially as a free preview via Scale Spellbook.

"},{"location":"pricing/#self-hosted-models","title":"Self-Hosted Models","text":"

We are committed to supporting the open-source community. Self-hosting LLM Engine will remain free and open-source.

We would love contributions from the community make this even more amazing!

"},{"location":"pricing/#hosted-models","title":"Hosted Models","text":"

Once the limited preview period has ended, billing for hosted models will be managed through the Scale Spellbook product.

Scale Spellbook leverages usage-based spending, billed to a credit card. Details on usage-based pricing will be shared with everyone before completing the limited preview.

"},{"location":"api/data_types/","title":"\ud83d\udc0d Python Client Data Type Reference","text":""},{"location":"api/data_types/#llmengine.CompletionOutput","title":"CompletionOutput","text":"

Bases: BaseModel

Represents the output of a completion request to a model.

"},{"location":"api/data_types/#llmengine.CompletionOutput.text","title":"text instance-attribute","text":"
text: str\n

The text of the completion.

"},{"location":"api/data_types/#llmengine.CompletionOutput.num_completion_tokens","title":"num_completion_tokens instance-attribute","text":"
num_completion_tokens: int\n

Number of tokens in the completion.

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput","title":"CompletionStreamOutput","text":"

Bases: BaseModel

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput.text","title":"text instance-attribute","text":"
text: str\n

The text of the completion.

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput.finished","title":"finished instance-attribute","text":"
finished: bool\n

Whether the completion is finished.

"},{"location":"api/data_types/#llmengine.CompletionStreamOutput.num_completion_tokens","title":"num_completion_tokens class-attribute instance-attribute","text":"
num_completion_tokens: Optional[int] = None\n

Number of tokens in the completion.

"},{"location":"api/data_types/#llmengine.CompletionSyncResponse","title":"CompletionSyncResponse","text":"

Bases: BaseModel

Response object for a synchronous prompt completion.

"},{"location":"api/data_types/#llmengine.CompletionSyncResponse.request_id","title":"request_id instance-attribute","text":"
request_id: str\n

The unique ID of the corresponding Completion request. This request_id is generated on the server, and all logs associated with the request are grouped by the request_id, which allows for easier troubleshooting of errors as follows:

"},{"location":"api/data_types/#llmengine.CompletionSyncResponse.output","title":"output instance-attribute","text":"
output: CompletionOutput\n

Completion output.

"},{"location":"api/data_types/#llmengine.CompletionStreamResponse","title":"CompletionStreamResponse","text":"

Bases: BaseModel

Response object for a stream prompt completion task.

"},{"location":"api/data_types/#llmengine.CompletionStreamResponse.request_id","title":"request_id instance-attribute","text":"
request_id: str\n

The unique ID of the corresponding Completion request. This request_id is generated on the server, and all logs associated with the request are grouped by the request_id, which allows for easier troubleshooting of errors as follows:

"},{"location":"api/data_types/#llmengine.CompletionStreamResponse.output","title":"output class-attribute instance-attribute","text":"
output: Optional[CompletionStreamOutput] = None\n

Completion output.

"},{"location":"api/data_types/#llmengine.CreateFineTuneResponse","title":"CreateFineTuneResponse","text":"

Bases: BaseModel

Response object for creating a FineTune.

"},{"location":"api/data_types/#llmengine.CreateFineTuneResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(\n    ..., description=\"ID of the created fine-tuning job.\"\n)\n

The ID of the FineTune.

"},{"location":"api/data_types/#llmengine.GetFineTuneResponse","title":"GetFineTuneResponse","text":"

Bases: BaseModel

Response object for retrieving a FineTune.

"},{"location":"api/data_types/#llmengine.GetFineTuneResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(..., description=\"ID of the requested job.\")\n

The ID of the FineTune.

"},{"location":"api/data_types/#llmengine.GetFineTuneResponse.fine_tuned_model","title":"fine_tuned_model class-attribute instance-attribute","text":"
fine_tuned_model: Optional[str] = Field(\n    default=None,\n    description=\"Name of the resulting fine-tuned model. This can be plugged into the Completion API once the fine-tune is complete\",\n)\n

The name of the resulting fine-tuned model. This can be plugged into the Completion API once the fine-tune is complete.

"},{"location":"api/data_types/#llmengine.ListFineTunesResponse","title":"ListFineTunesResponse","text":"

Bases: BaseModel

Response object for listing FineTunes.

"},{"location":"api/data_types/#llmengine.ListFineTunesResponse.jobs","title":"jobs class-attribute instance-attribute","text":"
jobs: List[GetFineTuneResponse] = Field(\n    ...,\n    description=\"List of fine-tuning jobs and their statuses.\",\n)\n

A list of FineTunes, represented as GetFineTuneResponses.

"},{"location":"api/data_types/#llmengine.CancelFineTuneResponse","title":"CancelFineTuneResponse","text":"

Bases: BaseModel

Response object for cancelling a FineTune.

"},{"location":"api/data_types/#llmengine.CancelFineTuneResponse.success","title":"success class-attribute instance-attribute","text":"
success: bool = Field(\n    ..., description=\"Whether cancellation was successful.\"\n)\n

Whether the cancellation succeeded.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse","title":"GetLLMEndpointResponse","text":"

Bases: BaseModel

Response object for retrieving a Model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.name","title":"name class-attribute instance-attribute","text":"
name: str = Field(\n    description=\"The name of the model. Use this for making inference requests to the model.\"\n)\n

The name of the model. Use this for making inference requests to the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.source","title":"source class-attribute instance-attribute","text":"
source: LLMSource = Field(\n    description=\"The source of the model, e.g. Hugging Face.\"\n)\n

The source of the model, e.g. Hugging Face.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.inference_framework","title":"inference_framework class-attribute instance-attribute","text":"
inference_framework: LLMInferenceFramework = Field(\n    description=\"The inference framework used by the model.\"\n)\n

(For self-hosted users) The inference framework used by the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.id","title":"id class-attribute instance-attribute","text":"
id: Optional[str] = Field(\n    default=None,\n    description=\"(For self-hosted users) The autogenerated ID of the model.\",\n)\n

(For self-hosted users) The autogenerated ID of the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.model_name","title":"model_name class-attribute instance-attribute","text":"
model_name: Optional[str] = Field(\n    default=None,\n    description=\"(For self-hosted users) For fine-tuned models, the base model. For base models, this will be the same as `name`.\",\n)\n

(For self-hosted users) For fine-tuned models, the base model. For base models, this will be the same as name.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.status","title":"status class-attribute instance-attribute","text":"
status: ModelEndpointStatus = Field(\n    description=\"The status of the model.\"\n)\n

The status of the model (can be one of \"READY\", \"UPDATE_PENDING\", \"UPDATE_IN_PROGRESS\", \"UPDATE_FAILED\", \"DELETE_IN_PROGRESS\").

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.inference_framework_tag","title":"inference_framework_tag class-attribute instance-attribute","text":"
inference_framework_tag: Optional[str] = Field(\n    default=None,\n    description=\"(For self-hosted users) The Docker image tag used to run the model.\",\n)\n

(For self-hosted users) The Docker image tag used to run the model.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.num_shards","title":"num_shards class-attribute instance-attribute","text":"
num_shards: Optional[int] = Field(\n    default=None,\n    description=\"(For self-hosted users) The number of shards.\",\n)\n

(For self-hosted users) The number of shards.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.quantize","title":"quantize class-attribute instance-attribute","text":"
quantize: Optional[Quantization] = Field(\n    default=None,\n    description=\"(For self-hosted users) The quantization method.\",\n)\n

(For self-hosted users) The quantization method.

"},{"location":"api/data_types/#llmengine.GetLLMEndpointResponse.spec","title":"spec class-attribute instance-attribute","text":"
spec: Optional[GetModelEndpointResponse] = Field(\n    default=None,\n    description=\"(For self-hosted users) Model endpoint details.\",\n)\n

(For self-hosted users) Model endpoint details.

"},{"location":"api/data_types/#llmengine.ListLLMEndpointsResponse","title":"ListLLMEndpointsResponse","text":"

Bases: BaseModel

Response object for listing Models.

"},{"location":"api/data_types/#llmengine.ListLLMEndpointsResponse.model_endpoints","title":"model_endpoints class-attribute instance-attribute","text":"
model_endpoints: List[GetLLMEndpointResponse] = Field(\n    ..., description=\"The list of models.\"\n)\n

A list of Models, represented as GetLLMEndpointResponses.

"},{"location":"api/data_types/#llmengine.DeleteLLMEndpointResponse","title":"DeleteLLMEndpointResponse","text":"

Bases: BaseModel

Response object for deleting a Model.

"},{"location":"api/data_types/#llmengine.DeleteLLMEndpointResponse.deleted","title":"deleted class-attribute instance-attribute","text":"
deleted: bool = Field(\n    ..., description=\"Whether deletion was successful.\"\n)\n

Whether the deletion succeeded.

"},{"location":"api/data_types/#llmengine.ModelDownloadRequest","title":"ModelDownloadRequest","text":"

Bases: BaseModel

Request object for downloading a model.

"},{"location":"api/data_types/#llmengine.ModelDownloadRequest.model_name","title":"model_name class-attribute instance-attribute","text":"
model_name: str = Field(\n    ..., description=\"Name of the model to download.\"\n)\n
"},{"location":"api/data_types/#llmengine.ModelDownloadRequest.download_format","title":"download_format class-attribute instance-attribute","text":"
download_format: Optional[str] = Field(\n    default=\"hugging_face\",\n    description=\"Desired return format for downloaded model weights (default=hugging_face).\",\n)\n
"},{"location":"api/data_types/#llmengine.ModelDownloadResponse","title":"ModelDownloadResponse","text":"

Bases: BaseModel

Response object for downloading a model.

"},{"location":"api/data_types/#llmengine.ModelDownloadResponse.urls","title":"urls class-attribute instance-attribute","text":"
urls: Dict[str, str] = Field(\n    ...,\n    description=\"Dictionary of (file_name, url) pairs to download the model from.\",\n)\n
"},{"location":"api/data_types/#llmengine.UploadFileResponse","title":"UploadFileResponse","text":"

Bases: BaseModel

Response object for uploading a file.

"},{"location":"api/data_types/#llmengine.UploadFileResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(..., description=\"ID of the uploaded file.\")\n

ID of the uploaded file.

"},{"location":"api/data_types/#llmengine.GetFileResponse","title":"GetFileResponse","text":"

Bases: BaseModel

Response object for retrieving a file.

"},{"location":"api/data_types/#llmengine.GetFileResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(\n    ..., description=\"ID of the requested file.\"\n)\n

ID of the requested file.

"},{"location":"api/data_types/#llmengine.GetFileResponse.filename","title":"filename class-attribute instance-attribute","text":"
filename: str = Field(..., description='File name.')\n

File name.

"},{"location":"api/data_types/#llmengine.GetFileResponse.size","title":"size class-attribute instance-attribute","text":"
size: int = Field(\n    ..., description=\"Length of the file, in characters.\"\n)\n

Length of the file, in characters.

"},{"location":"api/data_types/#llmengine.GetFileContentResponse","title":"GetFileContentResponse","text":"

Bases: BaseModel

Response object for retrieving a file's content.

"},{"location":"api/data_types/#llmengine.GetFileContentResponse.id","title":"id class-attribute instance-attribute","text":"
id: str = Field(\n    ..., description=\"ID of the requested file.\"\n)\n

ID of the requested file.

"},{"location":"api/data_types/#llmengine.GetFileContentResponse.content","title":"content class-attribute instance-attribute","text":"
content: str = Field(..., description='File content.')\n

File content.

"},{"location":"api/data_types/#llmengine.ListFilesResponse","title":"ListFilesResponse","text":"

Bases: BaseModel

Response object for listing files.

"},{"location":"api/data_types/#llmengine.ListFilesResponse.files","title":"files class-attribute instance-attribute","text":"
files: List[GetFileResponse] = Field(\n    ..., description=\"List of file IDs, names, and sizes.\"\n)\n

List of file IDs, names, and sizes.

"},{"location":"api/data_types/#llmengine.DeleteFileResponse","title":"DeleteFileResponse","text":"

Bases: BaseModel

Response object for deleting a file.

"},{"location":"api/data_types/#llmengine.DeleteFileResponse.deleted","title":"deleted class-attribute instance-attribute","text":"
deleted: bool = Field(\n    ..., description=\"Whether deletion was successful.\"\n)\n

Whether deletion was successful.

"},{"location":"api/error_handling/","title":"Error handling","text":"

LLM Engine uses conventional HTTP response codes to indicate the success or failure of an API request. In general: codes in the 2xx range indicate success. Codes in the 4xx range indicate indicate an error that failed given the information provided (e.g. a given Model was not found, or an invalid temperature was specified). Codes in the 5xx range indicate an error with the LLM Engine servers.

In the Python client, errors are presented via a set of corresponding Exception classes, which should be caught and handled by the user accordingly.

"},{"location":"api/error_handling/#llmengine.errors.BadRequestError","title":"BadRequestError","text":"
BadRequestError(message: str)\n

Bases: Exception

Corresponds to HTTP 400. Indicates that the request had inputs that were invalid. The user should not attempt to retry the request without changing the inputs.

"},{"location":"api/error_handling/#llmengine.errors.UnauthorizedError","title":"UnauthorizedError","text":"
UnauthorizedError(message: str)\n

Bases: Exception

Corresponds to HTTP 401. This means that no valid API key was provided.

"},{"location":"api/error_handling/#llmengine.errors.NotFoundError","title":"NotFoundError","text":"
NotFoundError(message: str)\n

Bases: Exception

Corresponds to HTTP 404. This means that the resource (e.g. a Model, FineTune, etc.) could not be found. Note that this can also be returned in some cases where the object might exist, but the user does not have access to the object. This is done to avoid leaking information about the existence or nonexistence of said object that the user does not have access to.

"},{"location":"api/error_handling/#llmengine.errors.RateLimitExceededError","title":"RateLimitExceededError","text":"
RateLimitExceededError(message: str)\n

Bases: Exception

Corresponds to HTTP 429. Too many requests hit the API too quickly. We recommend an exponential backoff for retries.

"},{"location":"api/error_handling/#llmengine.errors.ServerError","title":"ServerError","text":"
ServerError(status_code: int, message: str)\n

Bases: Exception

Corresponds to HTTP 5xx errors on the server.

"},{"location":"api/langchain/","title":"\ud83e\udd9c Langchain","text":"

Coming soon!

"},{"location":"api/python_client/","title":"\ud83d\udc0d Python Client API Reference","text":""},{"location":"api/python_client/#llmengine.Completion","title":"Completion","text":"

Bases: APIEngine

Completion API. This API is used to generate text completions.

Language models are trained to understand natural language and predict text outputs as a response to their inputs. The inputs are called prompts and the outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words; this process is language dependent.

The Completion API can be run either synchronous or asynchronously (via Python asyncio). For each of these modes, you can also choose whether to stream token responses or not.

"},{"location":"api/python_client/#llmengine.Completion.create","title":"create classmethod","text":"
create(\n    model: str,\n    prompt: str,\n    max_new_tokens: int = 20,\n    temperature: float = 0.2,\n    stop_sequences: Optional[List[str]] = None,\n    return_token_log_probs: Optional[bool] = False,\n    presence_penalty: Optional[float] = None,\n    frequency_penalty: Optional[float] = None,\n    top_k: Optional[int] = None,\n    top_p: Optional[float] = None,\n    timeout: int = COMPLETION_TIMEOUT,\n    stream: bool = False,\n) -> Union[\n    CompletionSyncResponse,\n    Iterator[CompletionStreamResponse],\n]\n

Creates a completion for the provided prompt and parameters synchronously.

This API can be used to get the LLM to generate a completion synchronously. It takes as parameters the model (see Model Zoo) and the prompt. Optionally it takes max_new_tokens, temperature, timeout and stream. It returns a CompletionSyncResponse if stream=False or an async iterator of CompletionStreamResponse with request_id and outputs fields.

Parameters:

Name Type Description Default model str

Name of the model to use. See Model Zoo for a list of Models that are supported.

required prompt str

The prompt to generate completions for, encoded as a string.

required max_new_tokens int

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_new_tokens cannot exceed the model's context length. See Model Zoo for information on each supported model's context length.

20 temperature float

What sampling temperature to use, in the range [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. When temperature is 0 greedy search is used.

0.2 stop_sequences Optional[List[str]]

One or more sequences where the API will stop generating tokens for the current completion.

None return_token_log_probs Optional[bool]

Whether to return the log probabilities of generated tokens. When True, the response will include a list of tokens and their log probabilities.

False presence_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None frequency_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None top_k Optional[int]

Integer that controls the number of top tokens to consider. Range: [1, infinity). -1 means consider all tokens.

None top_p Optional[float]

Float that controls the cumulative probability of the top tokens to consider. Range: (0.0, 1.0]. 1.0 means consider all tokens.

None timeout int

Timeout in seconds. This is the maximum amount of time you are willing to wait for a response.

COMPLETION_TIMEOUT stream bool

Whether to stream the response. If true, the return type is an Iterator[CompletionStreamResponse]. Otherwise, the return type is a CompletionSyncResponse. When streaming, tokens will be sent as data-only server-sent events.

False

Returns:

Name Type Description response Union[CompletionSyncResponse, AsyncIterable[CompletionStreamResponse]]

The generated response (if stream=False) or iterator of response chunks (if stream=True)

Synchronous completion without token streaming in PythonResponse in JSON
from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Hello, my name is\",\n    max_new_tokens=10,\n    temperature=0.2,\n)\nprint(response.json())\n
{\n    \"request_id\": \"8bbd0e83-f94c-465b-a12b-aabad45750a9\",\n    \"output\": {\n        \"text\": \"_______ and I am a _______\",\n        \"num_completion_tokens\": 10\n}\n}\n

Token streaming can be used to reduce perceived latency for applications. Here is how applications can use streaming:

Synchronous completion with token streaming in PythonResponse in JSON
from llmengine import Completion\n\nstream = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"why is the sky blue?\",\n    max_new_tokens=5,\n    temperature=0.2,\n    stream=True,\n)\n\nfor response in stream:\n    if response.output:\n        print(response.json())\n
{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"\\n\", \"finished\": false, \"num_completion_tokens\": 1 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"I\", \"finished\": false, \"num_completion_tokens\": 2 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \" don\", \"finished\": false, \"num_completion_tokens\": 3 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"\u2019\", \"finished\": false, \"num_completion_tokens\": 4 } }\n{\"request_id\": \"ebbde00c-8c31-4c03-8306-24f37cd25fa2\", \"output\": {\"text\": \"t\", \"finished\": true, \"num_completion_tokens\": 5 } }\n
"},{"location":"api/python_client/#llmengine.Completion.acreate","title":"acreate async classmethod","text":"
acreate(\n    model: str,\n    prompt: str,\n    max_new_tokens: int = 20,\n    temperature: float = 0.2,\n    stop_sequences: Optional[List[str]] = None,\n    return_token_log_probs: Optional[bool] = False,\n    presence_penalty: Optional[float] = None,\n    frequency_penalty: Optional[float] = None,\n    top_k: Optional[int] = None,\n    top_p: Optional[float] = None,\n    timeout: int = COMPLETION_TIMEOUT,\n    stream: bool = False,\n) -> Union[\n    CompletionSyncResponse,\n    AsyncIterable[CompletionStreamResponse],\n]\n

Creates a completion for the provided prompt and parameters asynchronously (with asyncio).

This API can be used to get the LLM to generate a completion asynchronously. It takes as parameters the model (see Model Zoo) and the prompt. Optionally it takes max_new_tokens, temperature, timeout and stream. It returns a CompletionSyncResponse if stream=False or an async iterator of CompletionStreamResponse with request_id and outputs fields.

Parameters:

Name Type Description Default model str

Name of the model to use. See Model Zoo for a list of Models that are supported.

required prompt str

The prompt to generate completions for, encoded as a string.

required max_new_tokens int

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_new_tokens cannot exceed the model's context length. See Model Zoo for information on each supported model's context length.

20 temperature float

What sampling temperature to use, in the range [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. When temperature is 0 greedy search is used.

0.2 stop_sequences Optional[List[str]]

One or more sequences where the API will stop generating tokens for the current completion.

None return_token_log_probs Optional[bool]

Whether to return the log probabilities of generated tokens. When True, the response will include a list of tokens and their log probabilities.

False presence_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None frequency_penalty Optional[float]

Only supported in vllm, lightllm Penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. https://platform.openai.com/docs/guides/gpt/parameter-details Range: [0.0, 2.0]. Higher values encourage the model to use new tokens.

None top_k Optional[int]

Integer that controls the number of top tokens to consider. Range: [1, infinity). -1 means consider all tokens.

None top_p Optional[float]

Float that controls the cumulative probability of the top tokens to consider. Range: (0.0, 1.0]. 1.0 means consider all tokens.

None timeout int

Timeout in seconds. This is the maximum amount of time you are willing to wait for a response.

COMPLETION_TIMEOUT stream bool

Whether to stream the response. If true, the return type is an Iterator[CompletionStreamResponse]. Otherwise, the return type is a CompletionSyncResponse. When streaming, tokens will be sent as data-only server-sent events.

False

Returns:

Name Type Description response Union[CompletionSyncResponse, AsyncIterable[CompletionStreamResponse]]

The generated response (if stream=False) or iterator of response chunks (if stream=True)

Asynchronous completion without token streaming in PythonResponse in JSON
import asyncio\nfrom llmengine import Completion\n\nasync def main():\n    response = await Completion.acreate(\n        model=\"llama-2-7b\",\n        prompt=\"Hello, my name is\",\n        max_new_tokens=10,\n        temperature=0.2,\n    )\n    print(response.json())\n\nasyncio.run(main())\n
{\n    \"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\",\n    \"output\": {\n        \"text\": \"_______ and I am a _______\",\n        \"num_completion_tokens\": 10\n}\n}\n

Token streaming can be used to reduce perceived latency for applications. Here is how applications can use streaming:

Asynchronous completion with token streaming in PythonResponse in JSON
import asyncio\nfrom llmengine import Completion\n\nasync def main():\n    stream = await Completion.acreate(\n        model=\"llama-2-7b\",\n        prompt=\"why is the sky blue?\",\n        max_new_tokens=5,\n        temperature=0.2,\n        stream=True,\n    )\n\nasync for response in stream:\n        if response.output:\n            print(response.json())\n\nasyncio.run(main())\n
{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \"\\n\", \"finished\": false, \"num_completion_tokens\": 1}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \"I\", \"finished\": false, \"num_completion_tokens\": 2}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \" think\", \"finished\": false, \"num_completion_tokens\": 3}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \" the\", \"finished\": false, \"num_completion_tokens\": 4}}\n{\"request_id\": \"9cfe4d5a-f86f-4094-a935-87f871d90ec0\", \"output\": {\"text\": \" sky\", \"finished\": true, \"num_completion_tokens\": 5}}\n
"},{"location":"api/python_client/#llmengine.FineTune","title":"FineTune","text":"

Bases: APIEngine

FineTune API. This API is used to fine-tune models.

Fine-tuning is a process where the LLM is further trained on a task-specific dataset, allowing the model to adjust its parameters to better align with the task at hand. Fine-tuning is a supervised training phase, where prompt/response pairs are provided to optimize the performance of the LLM. LLM Engine currently uses LoRA for fine-tuning. Support for additional fine-tuning methods is upcoming.

LLM Engine provides APIs to create fine-tunes on a base model with training & validation datasets. APIs are also provided to list, cancel and retrieve fine-tuning jobs.

Creating a fine-tune will end with the creation of a Model, which you can view using Model.get(model_name) or delete using Model.delete(model_name).

"},{"location":"api/python_client/#llmengine.FineTune.create","title":"create classmethod","text":"
create(\n    model: str,\n    training_file: str,\n    validation_file: Optional[str] = None,\n    hyperparameters: Optional[\n        Dict[str, Union[str, int, float]]\n    ] = None,\n    wandb_config: Optional[Dict[str, Any]] = None,\n    suffix: Optional[str] = None,\n) -> CreateFineTuneResponse\n

Creates a job that fine-tunes a specified model with a given dataset.

This API can be used to fine-tune a model. The model is the name of base model (Model Zoo for available models) to fine-tune. The training and validation files should consist of prompt and response pairs. training_file and validation_file must be either publicly accessible HTTP or HTTPS URLs, or file IDs of files uploaded to LLM Engine's Files API (these will have the file- prefix). The referenced files must be CSV files that include two columns: prompt and response. A maximum of 100,000 rows of data is currently supported. At least 200 rows of data is recommended to start to see benefits from fine-tuning. For sequences longer than the native max_seq_length of the model, the sequences will be truncated.

A fine-tuning job can take roughly 30 minutes for a small dataset (~200 rows) and several hours for larger ones.

Parameters:

Name Type Description Default model `str`

The name of the base model to fine-tune. See Model Zoo for the list of available models to fine-tune.

required training_file `str`

Publicly accessible URL or file ID referencing a CSV file for training. When no validation_file is provided, one will automatically be created using a 10% split of the training_file data.

required validation_file `Optional[str]`

Publicly accessible URL or file ID referencing a CSV file for validation. The validation file is used to compute metrics which let LLM Engine pick the best fine-tuned checkpoint, which will be used for inference when fine-tuning is complete.

None hyperparameters `Optional[Dict[str, Union[str, int, float, Dict[str, Any]]]]`

A dict of hyperparameters to customize fine-tuning behavior.

Currently supported hyperparameters:

None wandb_config `Optional[Dict[str, Any]]`

A dict of configuration parameters for Weights & Biases. See Weights & Biases for more information. Set hyperparameter[\"report_to\"] to wandb to enable automatic finetune metrics logging. Must include api_key field which is the wandb API key. Also supports setting base_url to use a custom Weights & Biases server.

None suffix `Optional[str]`

A string that will be added to your fine-tuned model name. If present, the entire fine-tuned model name will be formatted like \"[model].[suffix].[YYMMDD-HHMMSS]\". If absent, the fine-tuned model name will be formatted \"[model].[YYMMDD-HHMMSS]\". For example, if suffix is \"my-experiment\", the fine-tuned model name could be \"llama-2-7b.my-experiment.230717-230150\". Note: suffix must be between 1 and 28 characters long, and can only contain alphanumeric characters and hyphens.

None

Returns:

Name Type Description CreateFineTuneResponse CreateFineTuneResponse

an object that contains the ID of the created fine-tuning job

Here is an example script to create a 5-row CSV of properly formatted data for fine-tuning an airline question answering bot:

Formatting data in Python
import csv\n# Define data\ndata = [\n  (\"What is your policy on carry-on luggage?\", \"Our policy allows each passenger to bring one piece of carry-on luggage and one personal item such as a purse or briefcase. The maximum size for carry-on luggage is 22 x 14 x 9 inches.\"),\n  (\"How can I change my flight?\", \"You can change your flight through our website or mobile app. Go to 'Manage my booking' section, enter your booking reference and last name, then follow the prompts to change your flight.\"),\n  (\"What meals are available on my flight?\", \"We offer a variety of meals depending on the flight's duration and route. These can range from snacks and light refreshments to full-course meals on long-haul flights. Specific meal options can be viewed during the booking process.\"),\n  (\"How early should I arrive at the airport before my flight?\", \"We recommend arriving at least two hours before domestic flights and three hours before international flights.\"),\n  \"Can I select my seat in advance?\", \"Yes, you can select your seat during the booking process or afterwards via the 'Manage my booking' section on our website or mobile app.\"),\n  ]\n\n# Write data to a CSV file\nwith open('customer_service_data.csv', 'w', newline='') as file:\n    writer = csv.writer(file)\n    writer.writerow([\"prompt\", \"response\"])\n    writer.writerows(data)\n

Currently, data needs to be uploaded to either a publicly accessible web URL or to LLM Engine's private file server so that it can be read for fine-tuning. Publicly accessible HTTP and HTTPS URLs are currently supported.

To privately share data with the LLM Engine API, use LLM Engine's File.upload API. You can upload data in local file to LLM Engine's private file server and then use the returned file ID to reference your data in the FineTune API. The file ID is generally in the form of file-<random_string>, e.g. \"file-7DLVeLdN2Ty4M2m\".

Example code for fine-tuning:

Fine-tuning in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.create(\n    model=\"llama-2-7b\",\n    training_file=\"file-7DLVeLdN2Ty4M2m\",\n)\n\nprint(response.json())\n
{\n    \"fine_tune_id\": \"ft-cir3eevt71r003ks6il0\"\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.get","title":"get classmethod","text":"
get(fine_tune_id: str) -> GetFineTuneResponse\n

Get status of a fine-tuning job.

This API can be used to get the status of an already running fine-tuning job. It takes as a single parameter the fine_tune_id and returns a GetFineTuneResponse object with the id and status (PENDING, STARTED, UNDEFINED, FAILURE or SUCCESS).

Parameters:

Name Type Description Default fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description GetFineTuneResponse GetFineTuneResponse

an object that contains the ID and status of the requested job

Getting status of fine-tuning in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.get(\n    fine_tune_id=\"ft-cir3eevt71r003ks6il0\",\n)\n\nprint(response.json())\n
{\n    \"fine_tune_id\": \"ft-cir3eevt71r003ks6il0\",\n    \"status\": \"STARTED\"\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.get_events","title":"get_events classmethod","text":"
get_events(fine_tune_id: str) -> GetFineTuneEventsResponse\n

Get events of a fine-tuning job.

This API can be used to get the list of detailed events for a fine-tuning job. It takes the fine_tune_id as a parameter and returns a response object which has a list of events that has happened for the fine-tuning job. Two events are logged periodically: an evaluation of the training loss, and an evaluation of the eval loss. This API will return all events for the fine-tuning job.

Parameters:

Name Type Description Default fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description GetFineTuneEventsResponse GetFineTuneEventsResponse

an object that contains the list of events for the fine-tuning job

Getting events for fine-tuning jobs in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.get_events(fine_tune_id=\"ft-cir3eevt71r003ks6il0\")\nprint(response.json())\n
{\n    \"events\":\n    [\n        {\n            \"timestamp\": 1689665099.6704428,\n            \"message\": \"{'loss': 2.108, 'learning_rate': 0.002, 'epoch': 0.7}\",\n            \"level\": \"info\"\n},\n        {\n            \"timestamp\": 1689665100.1966307,\n            \"message\": \"{'eval_loss': 1.67730712890625, 'eval_runtime': 0.2023, 'eval_samples_per_second': 24.717, 'eval_steps_per_second': 4.943, 'epoch': 0.7}\",\n            \"level\": \"info\"\n},\n        {\n            \"timestamp\": 1689665105.6544185,\n            \"message\": \"{'loss': 1.8961, 'learning_rate': 0.0017071067811865474, 'epoch': 1.39}\",\n            \"level\": \"info\"\n},\n        {\n            \"timestamp\": 1689665106.159139,\n            \"message\": \"{'eval_loss': 1.513688564300537, 'eval_runtime': 0.2025, 'eval_samples_per_second': 24.696, 'eval_steps_per_second': 4.939, 'epoch': 1.39}\",\n            \"level\": \"info\"\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.list","title":"list classmethod","text":"
list() -> ListFineTunesResponse\n

List fine-tuning jobs.

This API can be used to list all the fine-tuning jobs. It returns a list of pairs of fine_tune_id and status for all existing jobs.

Returns:

Name Type Description ListFineTunesResponse ListFineTunesResponse

an object that contains a list of all fine-tuning jobs and their statuses

Listing fine-tuning jobs in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.list()\nprint(response.json())\n
{\n    \"jobs\": [\n        {\n            \"fine_tune_id\": \"ft-cir3eevt71r003ks6il0\",\n            \"status\": \"STARTED\"\n},\n        {\n            \"fine_tune_id\": \"ft_def456\",\n            \"status\": \"SUCCESS\"\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.FineTune.cancel","title":"cancel classmethod","text":"
cancel(fine_tune_id: str) -> CancelFineTuneResponse\n

Cancel a fine-tuning job.

This API can be used to cancel an existing fine-tuning job if it's no longer required. It takes the fine_tune_id as a parameter and returns a response object which has a success field confirming if the cancellation was successful.

Parameters:

Name Type Description Default fine_tune_id `str`

ID of the fine-tuning job

required

Returns:

Name Type Description CancelFineTuneResponse CancelFineTuneResponse

an object that contains whether the cancellation was successful

Cancelling fine-tuning job in PythonResponse in JSON
from llmengine import FineTune\n\nresponse = FineTune.cancel(fine_tune_id=\"ft-cir3eevt71r003ks6il0\")\nprint(response.json())\n
{\n    \"success\": true\n}\n
"},{"location":"api/python_client/#llmengine.Model","title":"Model","text":"

Bases: APIEngine

Model API. This API is used to get, list, and delete models. Models include both base models built into LLM Engine, and fine-tuned models that you create through the FineTune.create() API.

See Model Zoo for the list of publicly available base models.

"},{"location":"api/python_client/#llmengine.Model.create","title":"create classmethod","text":"
create(\n    name: str,\n    model: str,\n    inference_framework_image_tag: str,\n    source: LLMSource = LLMSource.HUGGING_FACE,\n    inference_framework: LLMInferenceFramework = LLMInferenceFramework.VLLM,\n    num_shards: int = 1,\n    quantize: Optional[Quantization] = None,\n    checkpoint_path: Optional[str] = None,\n    cpus: int = 8,\n    memory: str = \"24Gi\",\n    storage: str = \"40Gi\",\n    gpus: int = 1,\n    min_workers: int = 0,\n    max_workers: int = 1,\n    per_worker: int = 2,\n    endpoint_type: ModelEndpointType = ModelEndpointType.STREAMING,\n    gpu_type: Optional[str] = \"nvidia-ampere-a10\",\n    high_priority: Optional[bool] = False,\n    post_inference_hooks: Optional[\n        List[PostInferenceHooks]\n    ] = None,\n    default_callback_url: Optional[str] = None,\n    public_inference: Optional[bool] = True,\n    labels: Optional[Dict[str, str]] = None,\n) -> CreateLLMEndpointResponse\n

Create an LLM model. Note: This API is only available for self-hosted users.

Parameters:

Name Type Description Default name `str`

Name of the endpoint

required model `str`

Name of the base model

required inference_framework_image_tag `str`

Image tag for the inference framework

required source `LLMSource`

Source of the LLM. Currently only HuggingFace is supported

HUGGING_FACE inference_framework `LLMInferenceFramework`

Inference framework for the LLM. Current supported frameworks are LLMInferenceFramework.DEEPSPEED, LLMInferenceFramework.TEXT_GENERATION_INFERENCE, LLMInferenceFramework.VLLM and LLMInferenceFramework.LIGHTLLM

VLLM num_shards `int`

Number of shards for the LLM. When bigger than 1, LLM will be sharded to multiple GPUs. Number of GPUs must be equal or larger than num_shards.

1 quantize `Optional[Quantization]`

Quantization method for the LLM. text_generation_inference supports bitsandbytes and vllm supports awq.

None checkpoint_path `Optional[str]`

Remote path to the checkpoint for the LLM. LLM engine must have permission to access the given path. Can be either a folder or a tar file. Folder is preferred since we don't need to untar and model loads faster. For model weights, safetensors are preferred but PyTorch checkpoints are also accepted (model loading will be longer).

None cpus `int`

Number of cpus each worker should get, e.g. 1, 2, etc. This must be greater than or equal to 1. Recommendation is set it to 8 * GPU count.

8 memory `str`

Amount of memory each worker should get, e.g. \"4Gi\", \"512Mi\", etc. This must be a positive amount of memory. Recommendation is set it to 24Gi * GPU count.

'24Gi' storage `str`

Amount of local ephemeral storage each worker should get, e.g. \"4Gi\", \"512Mi\", etc. This must be a positive amount of storage. Recommendataion is 40Gi for 7B models, 80Gi for 13B models and 200Gi for 70B models.

'40Gi' gpus `int`

Number of gpus each worker should get, e.g. 0, 1, etc.

1 min_workers `int`

The minimum number of workers. Must be greater than or equal to 0. This should be determined by computing the minimum throughput of your workload and dividing it by the throughput of a single worker. When this number is 0, max_workers must be 1, and the endpoint will autoscale between 0 and 1 pods. When this number is greater than 0, max_workers can be any number greater or equal to min_workers.

0 max_workers `int`

The maximum number of workers. Must be greater than or equal to 0, and as well as greater than or equal to min_workers. This should be determined by computing the maximum throughput of your workload and dividing it by the throughput of a single worker

1 per_worker `int`

The maximum number of concurrent requests that an individual worker can service. LLM engine automatically scales the number of workers for the endpoint so that each worker is processing per_worker requests, subject to the limits defined by min_workers and max_workers - If the average number of concurrent requests per worker is lower than per_worker, then the number of workers will be reduced. - Otherwise, if the average number of concurrent requests per worker is higher than per_worker, then the number of workers will be increased to meet the elevated traffic. Here is our recommendation for computing per_worker: 1. Compute min_workers and max_workers per your minimum and maximum throughput requirements. 2. Determine a value for the maximum number of concurrent requests in the workload. Divide this number by max_workers. Doing this ensures that the number of workers will \"climb\" to max_workers.

2 endpoint_type `ModelEndpointType`

Currently only \"streaming\" endpoints are supported.

STREAMING gpu_type `Optional[str]`

If specifying a non-zero number of gpus, this controls the type of gpu requested. Here are the supported values:

'nvidia-ampere-a10' high_priority `Optional[bool]`

Either True or False. Enabling this will allow the created endpoint to leverage the shared pool of prewarmed nodes for faster spinup time

False post_inference_hooks `Optional[List[PostInferenceHooks]]`

List of hooks to trigger after inference tasks are served

None default_callback_url `Optional[str]`

The default callback url to use for sync completion requests. This can be overridden in the task parameters for each individual task. post_inference_hooks must contain \"callback\" for the callback to be triggered

None public_inference `Optional[bool]`

If True, this endpoint will be available to all user IDs for inference

True labels `Optional[Dict[str, str]]`

An optional dictionary of key/value pairs to associate with this endpoint

None

Returns: CreateLLMEndpointResponse: creation task ID of the created Model. Currently not used.

Create Llama 2 7B model in PythonCreate Llama 2 13B model in PythonCreate Llama 2 70B model with 8bit quantization in Python
from llmengine import Model\n\nresponse = Model.create(\n    name=\"llama-2-7b-test\"\n    model=\"llama-2-7b\",\n    inference_framework_image_tag=\"0.2.1.post1\",\n    inference_framework=LLMInferenceFramework.VLLM,\n    num_shards=1,\n    checkpoint_path=\"s3://path/to/checkpoint\",\n    cpus=8,\n    memory=\"24Gi\",\n    storage=\"40Gi\",\n    gpus=1,\n    min_workers=0,\n    max_workers=1,\n    per_worker=10,\n    endpoint_type=ModelEndpointType.STREAMING,\n    gpu_type=\"nvidia-ampere-a10\",\n    public_inference=False,\n)\n\nprint(response.json())\n
from llmengine import Model\n\nresponse = Model.create(\n    name=\"llama-2-13b-test\"\n    model=\"llama-2-13b\",\n    inference_framework_image_tag=\"0.2.1.post1\",\n    inference_framework=LLMInferenceFramework.VLLM,\n    num_shards=2,\n    checkpoint_path=\"s3://path/to/checkpoint\",\n    cpus=16,\n    memory=\"48Gi\",\n    storage=\"80Gi\",\n    gpus=2,\n    min_workers=0,\n    max_workers=1,\n    per_worker=10,\n    endpoint_type=ModelEndpointType.STREAMING,\n    gpu_type=\"nvidia-ampere-a10\",\n    public_inference=False,\n)\n\nprint(response.json())\n
from llmengine import Model\n\nresponse = Model.create(\n    name=\"llama-2-70b-test\"\n    model=\"llama-2-70b\",\n    inference_framework_image_tag=\"0.9.4\",\n    inference_framework=LLMInferenceFramework.TEXT_GENERATION_INFERENCE,\n    num_shards=4,\n    quantize=\"bitsandbytes\",\n    checkpoint_path=\"s3://path/to/checkpoint\",\n    cpus=40,\n    memory=\"96Gi\",\n    storage=\"200Gi\",\n    gpus=4,\n    min_workers=0,\n    max_workers=1,\n    per_worker=10,\n    endpoint_type=ModelEndpointType.STREAMING,\n    gpu_type=\"nvidia-ampere-a10\",\n    public_inference=False,\n)\n\nprint(response.json())\n
"},{"location":"api/python_client/#llmengine.Model.get","title":"get classmethod","text":"
get(model: str) -> GetLLMEndpointResponse\n

Get information about an LLM model.

This API can be used to get information about a Model's source and inference framework. For self-hosted users, it returns additional information about number of shards, quantization, infra settings, etc. The function takes as a single parameter the name model and returns a GetLLMEndpointResponse object.

Parameters:

Name Type Description Default model `str`

Name of the model

required

Returns:

Name Type Description GetLLMEndpointResponse GetLLMEndpointResponse

object representing the LLM and configurations

Accessing model in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.get(\"llama-2-7b.suffix.2023-07-18-12-00-00\")\n\nprint(response.json())\n
{\n    \"id\": null,\n    \"name\": \"llama-2-7b.suffix.2023-07-18-12-00-00\",\n    \"model_name\": null,\n    \"source\": \"hugging_face\",\n    \"status\": \"READY\",\n    \"inference_framework\": \"text_generation_inference\",\n    \"inference_framework_tag\": null,\n    \"num_shards\": null,\n    \"quantize\": null,\n    \"spec\": null\n}\n
"},{"location":"api/python_client/#llmengine.Model.list","title":"list classmethod","text":"
list() -> ListLLMEndpointsResponse\n

List LLM models available to call inference on.

This API can be used to list all available models, including both publicly available models and user-created fine-tuned models. It returns a list of GetLLMEndpointResponse objects for all models. The most important field is the model name.

Returns:

Name Type Description ListLLMEndpointsResponse ListLLMEndpointsResponse

list of models

Listing available modes in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.list()\nprint(response.json())\n
{\n    \"model_endpoints\": [\n        {\n            \"id\": null,\n            \"name\": \"llama-2-7b.suffix.2023-07-18-12-00-00\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"text_generation_inference\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n},\n        {\n            \"id\": null,\n            \"name\": \"llama-2-7b\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"text_generation_inference\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n},\n        {\n            \"id\": null,\n            \"name\": \"llama-13b-deepspeed-sync\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"deepspeed\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n},\n        {\n            \"id\": null,\n            \"name\": \"falcon-40b\",\n            \"model_name\": null,\n            \"source\": \"hugging_face\",\n            \"inference_framework\": \"text_generation_inference\",\n            \"inference_framework_tag\": null,\n            \"num_shards\": null,\n            \"quantize\": null,\n            \"spec\": null\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.Model.delete","title":"delete classmethod","text":"
delete(\n    model_endpoint_name: str,\n) -> DeleteLLMEndpointResponse\n

Deletes an LLM model.

This API can be used to delete a fine-tuned model. It takes as parameter the name of the model and returns a response object which has a deleted field confirming if the deletion was successful. If called on a base model included with LLM Engine, an error will be thrown.

Parameters:

Name Type Description Default model_endpoint_name `str`

Name of the model endpoint to be deleted

required

Returns:

Name Type Description response DeleteLLMEndpointResponse

whether the model endpoint was successfully deleted

Deleting model in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.delete(\"llama-2-7b.suffix.2023-07-18-12-00-00\")\nprint(response.json())\n
{\n    \"deleted\": true\n}\n
"},{"location":"api/python_client/#llmengine.Model.download","title":"download classmethod","text":"
download(\n    model_name: str, download_format: str = \"hugging_face\"\n) -> ModelDownloadResponse\n

Download a fine-tuned model.

This API can be used to download the resulting model from a fine-tuning job. It takes the model_name and download_format as parameter and returns a response object which contains a dictonary of filename, url pairs associated with the fine-tuned model. The user can then download these urls to obtain the fine-tuned model. If called on a nonexistent model, an error will be thrown.

Parameters:

Name Type Description Default model_name `str`

name of the fine-tuned model

required download_format `str`

download format requested (default=hugging_face)

'hugging_face'

Returns: DownloadModelResponse: an object that contains a dictionary of filenames, urls from which to download the model weights. The urls are presigned urls that grant temporary access and expire after an hour.

Downloading model in PythonResponse in JSON
from llmengine import Model\n\nresponse = Model.download(\"llama-2-7b.suffix.2023-07-18-12-00-00\", download_format=\"hugging_face\")\nprint(response.json())\n
{\n    \"urls\": {\"my_model_file\": \"https://url-to-my-model-weights\"}\n}\n
"},{"location":"api/python_client/#llmengine.File","title":"File","text":"

Bases: APIEngine

File API. This API is used to upload private files to LLM engine so that fine-tunes can access them for training and validation data.

Functions are provided to upload, get, list, and delete files, as well as to get the contents of a file.

"},{"location":"api/python_client/#llmengine.File.upload","title":"upload classmethod","text":"
upload(file: BufferedReader) -> UploadFileResponse\n

Uploads a file to LLM engine.

For use in FineTune creation, this should be a CSV file with two columns: prompt and response. A maximum of 100,000 rows of data is currently supported.

Parameters:

Name Type Description Default file `BufferedReader`

A local file opened with open(file_path, \"r\")

required

Returns:

Name Type Description UploadFileResponse UploadFileResponse

an object that contains the ID of the uploaded file

Uploading file in PythonResponse in JSON
from llmengine import File\n\nresponse = File.upload(open(\"training_dataset.csv\", \"r\"))\n\nprint(response.json())\n
{\n    \"id\": \"file-abc123\"\n}\n
"},{"location":"api/python_client/#llmengine.File.get","title":"get classmethod","text":"
get(file_id: str) -> GetFileResponse\n

Get file metadata, including filename and size.

Parameters:

Name Type Description Default file_id `str`

ID of the file

required

Returns:

Name Type Description GetFileResponse GetFileResponse

an object that contains the ID, filename, and size of the requested file

Getting metadata about file in PythonResponse in JSON
from llmengine import File\n\nresponse = File.get(\n    file_id=\"file-abc123\",\n)\n\nprint(response.json())\n
{\n    \"id\": \"file-abc123\",\n    \"filename\": \"training_dataset.csv\",\n    \"size\": 100\n}\n
"},{"location":"api/python_client/#llmengine.File.download","title":"download classmethod","text":"
download(file_id: str) -> GetFileContentResponse\n

Get contents of a file, as a string. (If the uploaded file is in binary, a string encoding will be returned.)

Parameters:

Name Type Description Default file_id `str`

ID of the file

required

Returns:

Name Type Description GetFileContentResponse GetFileContentResponse

an object that contains the ID and content of the file

Getting file content in PythonResponse in JSON
from llmengine import File\n\nresponse = File.download(file_id=\"file-abc123\")\nprint(response.json())\n
{\n    \"id\": \"file-abc123\",\n    \"content\": \"Hello world!\"\n}\n
"},{"location":"api/python_client/#llmengine.File.list","title":"list classmethod","text":"
list() -> ListFilesResponse\n

List metadata about all files, e.g. their filenames and sizes.

Returns:

Name Type Description ListFilesResponse ListFilesResponse

an object that contains a list of all files and their filenames and sizes

Listing files in PythonResponse in JSON
from llmengine import File\n\nresponse = File.list()\nprint(response.json())\n
{\n    \"files\": [\n        {\n            \"id\": \"file-abc123\",\n            \"filename\": \"training_dataset.csv\",\n            \"size\": 100\n},\n        {\n            \"id\": \"file-def456\",\n            \"filename\": \"validation_dataset.csv\",\n            \"size\": 50\n}\n    ]\n}\n
"},{"location":"api/python_client/#llmengine.File.delete","title":"delete classmethod","text":"
delete(file_id: str) -> DeleteFileResponse\n

Deletes a file.

Parameters:

Name Type Description Default file_id `str`

ID of the file

required

Returns:

Name Type Description DeleteFileResponse DeleteFileResponse

an object that contains whether the deletion was successful

Deleting file in PythonResponse in JSON
from llmengine import File\n\nresponse = File.delete(file_id=\"file-abc123\")\nprint(response.json())\n
{\n    \"deleted\": true\n}\n
"},{"location":"guides/completions/","title":"Completions","text":"

Language Models are trained to predict natural language and provide text outputs as a response to their inputs. The inputs are called prompts and outputs are referred to as completions. LLMs take the input prompts and chunk them into smaller units called tokens to process and generate language. Tokens may include trailing spaces and even sub-words. This process is language dependent.

Scale's LLM Engine provides access to open source language models (see Model Zoo) that can be used for producing completions to prompts.

"},{"location":"guides/completions/#completion-api-call","title":"Completion API call","text":"

An example API call looks as follows:

Completion call in Python
from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Hello, my name is\",\n    max_new_tokens=10,\n    temperature=0.2,\n)\n\nprint(response.json())\n# '{\"request_id\": \"c4bf0732-08e0-48a8-8b44-dfe8d4702fb0\", \"output\": {\"text\": \"________ and I am a ________\", \"num_completion_tokens\": 10}}'\nprint(response.output.text)\n# ________ and I am a ________\n

See the full Completion API reference documentation to learn more.

"},{"location":"guides/completions/#completion-api-response","title":"Completion API response","text":"

An example Completion API response looks as follows:

Response in JSONResponse in Python
    >>> print(response.json())\n    {\n      \"request_id\": \"c4bf0732-08e0-48a8-8b44-dfe8d4702fb0\",\n      \"output\": {\n        \"text\": \"_______ and I am a _______\",\n        \"num_completion_tokens\": 10\n      }\n    }\n
    >>> print(response.output.text)\n    _______ and I am a _______\n
"},{"location":"guides/completions/#token-streaming","title":"Token streaming","text":"

The Completions API supports token streaming to reduce perceived latency for certain applications. When streaming, tokens will be sent as data-only server-side events.

To enable token streaming, pass stream=True to either Completion.create or Completion.acreate.

Note that errors from streaming calls are returned back to the user as plain-text messages and currently need to be handled by the client.

An example of token streaming using the synchronous Completions API looks as follows:

Token streaming with synchronous API in python
import sys\nfrom llmengine import Completion\n\nstream = Completion.create(\n    model=\"llama-2-7b\",\n    prompt=\"Give me a 200 word summary on the current economic events in the US.\",\n    max_new_tokens=1000,\n    temperature=0.2,\n    stream=True,\n)\n\nfor response in stream:\n    if response.output:\n        print(response.output.text, end=\"\")\n        sys.stdout.flush()\n    else: # an error occurred\nprint(response.error) # print the error message out \nbreak\n
"},{"location":"guides/completions/#async-requests","title":"Async requests","text":"

The Python client supports asyncio for creating Completions. Use Completion.acreate instead of Completion.create to utilize async processing. The function signatures are otherwise identical.

An example of async Completions looks as follows:

Completions with asynchronous API in python
import asyncio\nfrom llmengine import Completion\n\nasync def main():\n    response = await Completion.acreate(\n        model=\"llama-2-7b\",\n        prompt=\"Hello, my name is\",\n        max_new_tokens=10,\n        temperature=0.2,\n    )\n    print(response.json())\n\nasyncio.run(main())\n
"},{"location":"guides/completions/#which-model-should-i-use","title":"Which model should I use?","text":"

See the Model Zoo for more information on best practices for which model to use for Completions.

"},{"location":"guides/endpoint_creation/","title":"Endpoint creation","text":"

When creating a model endpoint, you can periodically poll the model status field to track the status of your model endpoint. In general, you'll need to wait after the model creation step for the model endpoint to be ready and available for use. An example is provided below:

model_name = \"test_deploy\"\nmodel = Model.create(name=model_name, model=\"llama-2-7b\", inference_frame_image_tag=\"0.9.4\")\nresponse = Model.get(model_name)\nwhile response.status.name != \"READY\":\n    print(response.status.name)\n    time.sleep(60)\n    response = Model.get(model_name)\n

Once the endpoint status is ready, you can use your newly created model for inference.

"},{"location":"guides/fine_tuning/","title":"Fine-tuning","text":"

Learn how to customize your models on your data with fine-tuning. Or get started right away with our fine-tuning cookbook.

"},{"location":"guides/fine_tuning/#introduction","title":"Introduction","text":"

Fine-tuning helps improve model performance by training on specific examples of prompts and desired responses. LLMs are initially trained on data collected from the entire internet. With fine-tuning, LLMs can be optimized to perform better in a specific domain by learning from examples for that domain. Smaller LLMs that have been fine-tuned on a specific use case often outperform larger ones that were trained more generally.

Fine-tuning allows for:

  1. Higher quality results than prompt engineering alone
  2. Cost savings through shorter prompts
  3. The ability to reach equivalent accuracy with a smaller model
  4. Lower latency at inference time
  5. The chance to show an LLM more examples than can fit in a single context window

LLM Engine's fine-tuning API lets you fine-tune various open source LLMs on your own data and then make inference calls to the resulting LLM. For more specific details, see the fine-tuning API reference.

"},{"location":"guides/fine_tuning/#producing-high-quality-data-for-fine-tuning","title":"Producing high quality data for fine-tuning","text":"

The training data for fine-tuning should consist of prompt and response pairs.

As a rule of thumb, you should expect to see linear improvements in your fine-tuned model's quality with each doubling of the dataset size. Having high-quality data is also essential to improving performance. For every linear increase in the error rate in your training data, you may encounter a roughly quadratic increase in your fine-tuned model's error rate.

High quality data is critical to achieve improved model performance, and in several cases will require experts to generate and prepare data - the breadth and diversity of the data is highly critical. Scale's Data Engine can help prepare such high quality, diverse data sets - more information here.

"},{"location":"guides/fine_tuning/#preparing-data","title":"Preparing data","text":"

Your data must be formatted as a CSV file that includes two columns: prompt and response. A maximum of 100,000 rows of data is currently supported. At least 200 rows of data is recommended to start to see benefits from fine-tuning. LLM Engine supports fine-tuning with a training and validation dataset. If only a training dataset is provided, 10% of the data is randomly split to be used as validation.

Here is an example script to create a 50-row CSV of properly formatted data for fine-tuning an airline question answering bot

Creating a sample dataset
import csv\n# Define data\ndata = [\n    (\"What is your policy on carry-on luggage?\", \"Our policy allows each passenger to bring one piece of carry-on luggage and one personal item such as a purse or briefcase. The maximum size for carry-on luggage is 22 x 14 x 9 inches.\"),\n    (\"How can I change my flight?\", \"You can change your flight through our website or mobile app. Go to 'Manage my booking' section, enter your booking reference and last name, then follow the prompts to change your flight.\"),\n    (\"What meals are available on my flight?\", \"We offer a variety of meals depending on the flight's duration and route. These can range from snacks and light refreshments to full-course meals on long-haul flights. Specific meal options can be viewed during the booking process.\"),\n    (\"How early should I arrive at the airport before my flight?\", \"We recommend arriving at least two hours before domestic flights and three hours before international flights.\"),\n    (\"Can I select my seat in advance?\", \"Yes, you can select your seat during the booking process or afterwards via the 'Manage my booking' section on our website or mobile app.\"),\n    (\"What should I do if my luggage is lost?\", \"If your luggage is lost, please report this immediately at our 'Lost and Found' counter at the airport. We will assist you in tracking your luggage.\"),\n    (\"Do you offer special assistance for passengers with disabilities?\", \"Yes, we offer special assistance for passengers with disabilities. Please notify us of your needs at least 48 hours prior to your flight.\"),\n    (\"Can I bring my pet on the flight?\", \"Yes, we allow small pets in the cabin, and larger pets in the cargo hold. Please check our pet policy for more details.\"),\n    (\"What is your policy on flight cancellations?\", \"In case of flight cancellations, we aim to notify passengers as early as possible and offer either a refund or a rebooking on the next available flight.\"),\n    (\"Can I get a refund if I cancel my flight?\", \"Refunds depend on the type of ticket purchased. Please check our cancellation policy for details. Non-refundable tickets, however, are typically not eligible for refunds unless due to extraordinary circumstances.\"),\n    (\"How can I check-in for my flight?\", \"You can check-in for your flight either online, through our mobile app, or at the airport. Online and mobile app check-in opens 24 hours before departure and closes 90 minutes before.\"),\n    (\"Do you offer free meals on your flights?\", \"Yes, we serve free meals on all long-haul flights. For short-haul flights, we offer a complimentary drink and snack. Special meal requests should be made at least 48 hours before departure.\"),\n    (\"Can I use my electronic devices during the flight?\", \"Small electronic devices can be used throughout the flight in flight mode. Larger devices like laptops may be used above 10,000 feet.\"),\n    (\"How much baggage can I check-in?\", \"The checked baggage allowance depends on the class of travel and route. The details would be mentioned on your ticket, or you can check on our website.\"),\n    (\"How can I request for a wheelchair?\", \"To request a wheelchair or any other special assistance, please call our customer service at least 48 hours before your flight.\"),\n    (\"Do I get a discount for group bookings?\", \"Yes, we offer discounts on group bookings of 10 or more passengers. Please contact our group bookings team for more information.\"),\n    (\"Do you offer Wi-fi on your flights?\", \"Yes, we offer complimentary Wi-fi on select flights. You can check the availability during the booking process.\"),\n    (\"What is the minimum connecting time between flights?\", \"The minimum connecting time varies depending on the airport and whether your flight is international or domestic. Generally, it's recommended to allow at least 45-60 minutes for domestic connections and 60-120 minutes for international.\"),\n    (\"Do you offer duty-free shopping on international flights?\", \"Yes, we have a selection of duty-free items that you can pre-order on our website or purchase onboard on international flights.\"),\n    (\"Can I upgrade my ticket to business class?\", \"Yes, you can upgrade your ticket through the 'Manage my booking' section on our website or by contacting our customer service. The availability and costs depend on the specific flight.\"),\n    (\"Can unaccompanied minors travel on your flights?\", \"Yes, we do accommodate unaccompanied minors on our flights, with special services to ensure their safety and comfort. Please contact our customer service for more details.\"),\n    (\"What amenities do you provide in business class?\", \"In business class, you will enjoy additional legroom, reclining seats, premium meals, priority boarding and disembarkation, access to our business lounge, extra baggage allowance, and personalized service.\"),\n    (\"How much does extra baggage cost?\", \"Extra baggage costs vary based on flight route and the weight of the baggage. Please refer to our 'Extra Baggage' section on the website for specific rates.\"),\n    (\"Are there any specific rules for carrying liquids in carry-on?\", \"Yes, liquids carried in your hand luggage must be in containers of 100 ml or less and they should all fit into a single, transparent, resealable plastic bag of 20 cm x 20 cm.\"),\n    (\"What if I have a medical condition that requires special assistance during the flight?\", \"We aim to make the flight comfortable for all passengers. If you have a medical condition that may require special assistance, please contact our \u2018special services\u2019 team 48 hours before your flight.\"),\n    (\"What in-flight entertainment options are available?\", \"We offer a range of in-flight entertainment options including a selection of movies, TV shows, music, and games, available on your personal seat-back screen.\"),\n    (\"What types of payment methods do you accept?\", \"We accept credit/debit cards, PayPal, bank transfers, and various other forms of payment. The available options may vary depending on the country of departure.\"),\n    (\"How can I earn and redeem frequent flyer miles?\", \"You can earn miles for every journey you take with us or our partner airlines. These miles can be redeemed for flight tickets, upgrades, or various other benefits. To earn and redeem miles, you need to join our frequent flyer program.\"),\n    (\"Can I bring a stroller for my baby?\", \"Yes, you can bring a stroller for your baby. It can be checked in for free and will normally be given back to you at the aircraft door upon arrival.\"),\n    (\"What age does my child have to be to qualify as an unaccompanied minor?\", \"Children aged between 5 and 12 years who are traveling alone are considered unaccompanied minors. Our team provides special care for these children from departure to arrival.\"),\n    (\"What documents do I need to travel internationally?\", \"For international travel, you need a valid passport and may also require visas, depending on your destination and your country of residence. It's important to check the specific requirements before you travel.\"),\n    (\"What happens if I miss my flight?\", \"If you miss your flight, please contact our customer service immediately. Depending on the circumstances, you may be able to rebook on a later flight, but additional fees may apply.\"),\n    (\"Can I travel with my musical instrument?\", \"Yes, small musical instruments can be brought on board as your one carry-on item. Larger instruments must be transported in the cargo, or if small enough, a seat may be purchased for them.\"),\n    (\"Do you offer discounts for children or infants?\", \"Yes, children aged 2-11 traveling with an adult usually receive a discount on the fare. Infants under the age of 2 who do not occupy a seat can travel for a reduced fare or sometimes for free.\"),\n    (\"Is smoking allowed on your flights?\", \"No, all our flights are non-smoking for the comfort and safety of all passengers.\"),\n    (\"Do you have family seating?\", \"Yes, we offer the option to seat families together. You can select seats during booking or afterwards through the 'Manage my booking' section on the website.\"),\n    (\"Is there any discount for senior citizens?\", \"Some flights may offer a discount for senior citizens. Please check our website or contact customer service for accurate information.\"),\n    (\"What items are prohibited on your flights?\", \"Prohibited items include, but are not limited to, sharp objects, firearms, explosive materials, and certain chemicals. You can find a comprehensive list on our website under the 'Security Regulations' section.\"),\n    (\"Can I purchase a ticket for someone else?\", \"Yes, you can purchase a ticket for someone else. You'll need their correct name as it appears on their government-issued ID, and their correct travel dates.\"),\n    (\"What is the process for lost and found items on the plane?\", \"If you realize you forgot an item on the plane, report it as soon as possible to our lost and found counter. We will make every effort to locate and return your item.\"),\n    (\"Can I request a special meal?\", \"Yes, we offer a variety of special meals to accommodate dietary restrictions. Please request your preferred meal at least 48 hours prior to your flight.\"),\n    (\"Is there a weight limit for checked baggage?\", \"Yes, luggage weight limits depend on your ticket class and route. You can find the details on your ticket or by visiting our website.\"),\n    (\"Can I bring my sports equipment?\", \"Yes, certain types of sports equipment can be carried either as or in addition to your permitted baggage. Some equipment may require additional fees. It's best to check our policy on our website or contact us directly.\"),\n    (\"Do I need a visa to travel to certain countries?\", \"Yes, visa requirements depend on the country you are visiting and your nationality. We advise checking with the relevant embassy or consulate prior to travel.\"),\n    (\"How can I add extra baggage to my booking?\", \"You can add extra baggage to your booking through the 'Manage my booking' section on our website or by contacting our customer services.\"),\n    (\"Can I check-in at the airport?\", \"Yes, you can choose to check-in at the airport. However, we also offer online and mobile check-in, which may save you time.\"),\n    (\"How do I know if my flight is delayed or cancelled?\", \"In case of any changes to your flight, we will attempt to notify all passengers using the contact information given at the time of booking. You can also check your flight status on our website.\"),\n    (\"What is your policy on pregnant passengers?\", \"Pregnant passengers can travel up to the end of the 36th week for single pregnancies, and the end of the 32nd week for multiple pregnancies. We recommend consulting your doctor before any air travel.\"),\n    (\"Can children travel alone?\", \"Yes, children age 5 to 12 can travel alone as unaccompanied minors. We provide special care for these seats. Please contact our customer service for more information.\"),\n    (\"How can I pay for my booking?\", \"You can pay for your booking using a variety of methods including credit and debit cards, PayPal, or bank transfers. The options may vary depending on the country of departure.\"),\n]\n\n# Write data to a CSV file\nwith open('customer_service_data.csv', 'w', newline='') as file:\n    writer = csv.writer(file)\n    writer.writerow([\"prompt\", \"response\"])\n    writer.writerows(data)\n
"},{"location":"guides/fine_tuning/#making-your-data-accessible-to-llm-engine","title":"Making your data accessible to LLM Engine","text":"

Currently, data needs to be uploaded to either a publicly accessible web URL or to LLM Engine's private file server so that it can be read for fine-tuning. Publicly accessible HTTP and HTTPS URLs are currently supported.

To privately share data with the LLM Engine API, use LLM Engine's File.upload API. You can upload data in local file to LLM Engine's private file server and then use the returned file ID to reference your data in the FineTune API. The file ID is generally in the form of file-<random_string>, e.g. \"file-7DLVeLdN2Ty4M2m\".

Upload to LLM Engine's private file server
from llmengine import File\n\nresponse = File.upload(open(\"customer_service_data.csv\", \"r\"))\nprint(response.json())\n
"},{"location":"guides/fine_tuning/#launching-the-fine-tune","title":"Launching the fine-tune","text":"

Once you have uploaded your data, you can use the LLM Engine's FineTune.Create API to launch a fine-tune. You will need to specify which base model to fine-tune, the locations of the training file and optional validation data file, an optional set of hyperparameters to customize the fine-tuning behavior, and an optional suffix to append to the name of the fine-tune. For sequences longer than the native max_seq_length of the model, the sequences will be truncated.

If you specify a suffix, the fine-tune will be named model.suffix.<timestamp>. If you do not, the fine-tune will be named model.<timestamp>. The timestamp will be the time the fine-tune was launched. Note: the suffix must only contain alphanumeric characters and hyphens, and be at most 28 characters long.

Hyper-parameters for fine-tune - `lr`: Peak learning rate used during fine-tuning. It decays with a cosine schedule afterward. (Default: 2e-3) - `warmup_ratio`: Ratio of training steps used for learning rate warmup. (Default: 0.03) - `epochs`: Number of fine-tuning epochs. This should be less than 20. (Default: 5) - `weight_decay`: Regularization penalty applied to learned weights. (Default: 0.001) Create a fine-tune in python
from llmengine import FineTune\n\nresponse = FineTune.create(\n    model=\"llama-2-7b\",\n    training_file=\"file-AbCDeLdN2Ty4M2m\",\n    validation_file=\"file-ezSRpgtKQyItI26\",\n)\n\nprint(response.json())\n

See the Model Zoo to see which models have fine-tuning support.

See Integrations to see how to track fine-tuning metrics.

"},{"location":"guides/fine_tuning/#monitoring-the-fine-tune","title":"Monitoring the fine-tune","text":"

Once the fine-tune is launched, you can also get the status of your fine-tune. You can also list events that your fine-tune produces.

from llmengine import FineTune\n\nfine_tune_id = \"ft-cabcdefghi1234567890\"\nfine_tune = FineTune.get(fine_tune_id)\nprint(fine_tune.status)  # BatchJobStatus.RUNNING\nprint(fine_tune.fine_tuned_model)  # \"llama-2-7b.700101-000000\nfine_tune_events = FineTune.get_events(fine_tune_id)\nfor event in fine_tune_events.events:\n    print(event)\n# Prints something like:\n# timestamp=1697590000.0 message=\"{'loss': 12.345, 'learning_rate': 0.0, 'epoch': 0.97}\" level='info'\n# timestamp=1697590000.0 message=\"{'eval_loss': 23.456, 'eval_runtime': 19.876, 'eval_samples_per_second': 4.9, 'eval_steps_per_second': 4.9, 'epoch': 0.97}\" level='info'\n# timestamp=1697590020.0 message=\"{'train_runtime': 421.234, 'train_samples_per_second': 2.042, 'train_steps_per_second': 0.042, 'total_flos': 123.45, 'train_loss': 34.567, 'epoch': 0.97}\" level='info'\n

The status of your fine-tune will give a high-level overview of the fine-tune's progress. The events of your fine-tune will give more detail, such as the training loss and validation loss at each epoch, as well as any errors that may have occurred. If you encounter any errors with your fine-tune, the events are a good place to start debugging. For example, if you see Unable to read training or validation dataset, you may need to make your files accessible to LLM Engine. If you see Invalid value received for lora parameter 'lora_alpha'!, you should check that your hyperparameters are valid.

"},{"location":"guides/fine_tuning/#making-inference-calls-to-your-fine-tune","title":"Making inference calls to your fine-tune","text":"

Once your fine-tune is finished, you will be able to start making inference requests to the model. You can use the fine_tuned_model returned from your FineTune.get API call to reference your fine-tuned model in the Completions API. Alternatively, you can list available LLMs with Model.list in order to find the name of your fine-tuned model. See the Completion API for more details. You can then use that name to direct your completion requests. You must wait until your fine-tune is complete before you can plug it into the Completions API. You can check the status of your fine-tune with FineTune.get.

Inference with a fine-tuned model in python
from llmengine import Completion\n\nresponse = Completion.create(\n    model=\"llama-2-7b.airlines.2023-07-17-08-30-45\",\n    prompt=\"Do you offer in-flight Wi-fi?\",\n    max_new_tokens=100,\n    temperature=0.2,\n)\nprint(response.json())\n
"},{"location":"guides/rate_limits/","title":"Overview","text":""},{"location":"guides/rate_limits/#what-are-rate-limits","title":"What are rate limits?","text":"

A rate limit is a restriction that an API imposes on the number of times a user or client can access the server within a specified period of time.

"},{"location":"guides/rate_limits/#how-do-i-know-if-i-am-rate-limited","title":"How do I know if I am rate limited?","text":"

Per standard HTTP practices, your request will receive a response with HTTP status code of 429, Too Many Requests.

"},{"location":"guides/rate_limits/#what-are-the-rate-limits-for-our-api","title":"What are the rate limits for our API?","text":"

The LLM Engine API is currently in a preview mode, and therefore we currently do not have any advertised rate limits. As the API moves towards a production release, we will update this section with specific rate limits. For now, the API will return HTTP 429 on an as-needed basis.

"},{"location":"guides/rate_limits/#error-mitigation","title":"Error mitigation","text":""},{"location":"guides/rate_limits/#retrying-with-exponential-backoff","title":"Retrying with exponential backoff","text":"

One easy way to avoid rate limit errors is to automatically retry requests with a random exponential backoff. Retrying with exponential backoff means performing a short sleep when a rate limit error is hit, then retrying the unsuccessful request. If the request is still unsuccessful, the sleep length is increased and the process is repeated. This continues until the request is successful or until a maximum number of retries is reached. This approach has many benefits:

Below are a few example solutions for Python that use exponential backoff.

"},{"location":"guides/rate_limits/#example-1-using-the-tenacity-library","title":"Example #1: Using the tenacity library","text":"

Tenacity is an Apache 2.0 licensed general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything. To add exponential backoff to your requests, you can use the tenacity.retry decorator. The below example uses the tenacity.wait_random_exponential function to add random exponential backoff to a request.

Exponential backoff in python
import llmengine\nfrom tenacity import (\n    retry,\n    stop_after_attempt,\n    wait_random_exponential,\n)  # for exponential backoff\n@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))\ndef completion_with_backoff(**kwargs):\n    return llmengine.Completion.create(**kwargs)\n\ncompletion_with_backoff(model=\"llama-2-7b\", prompt=\"Why is the sky blue?\")\n
"},{"location":"guides/rate_limits/#example-2-using-the-backoff-library","title":"Example #2: Using the backoff library","text":"

Backoff is another python library that provides function decorators which can be used to wrap a function such that it will be retried until some condition is met.

Decorators for backoff and retry in python
import llmengine\nimport backoff\n@backoff.on_exception(backoff.expo, llmengine.errors.RateLimitExceededError)\ndef completion_with_backoff(**kwargs):\n    return llmengine.Completion.create(**kwargs)\n\ncompletions_with_backoff(model=\"llama-2-7b\", prompt=\"Why is the sky blue?\")\n
"},{"location":"guides/self_hosting/","title":"Self Hosting [Experimental]","text":"

This guide is currently highly experimental. Instructions are subject to change as we improve support for self-hosting.

We provide a Helm chart that deploys LLM Engine to an Elastic Kubernetes Cluster (EKS) in AWS. This Helm chart should be configured to connect to dependencies (such as a PostgreSQL database) that you may already have available in your environment.

The only portions of the Helm chart that are production ready are the parts that configure and manage LLM Server itself (not PostgreSQL, IAM, etc.)

We first go over required AWS dependencies that are required to exist before we can run helm install in your EKS cluster.

"},{"location":"guides/self_hosting/#aws-dependencies","title":"AWS Dependencies","text":"

This section describes assumptions about existing AWS resources required run to the LLM Engine Server

"},{"location":"guides/self_hosting/#eks","title":"EKS","text":"

The LLM Engine server must be deployed in an EKS cluster environment. Currently only versions 1.23+ are supported. Below are the assumed requirements for the EKS cluster:

You will need to provision EKS node groups with GPUs to schedule model pods. These node groups must have the node-lifecycle: normal label on them. Additionally, they must have the k8s.amazonaws.com/accelerator label set appropriately depending on the instance type:

Instance family k8s.amazonaws.com/accelerator label g4dn nvidia-tesla-t4 g5 nvidia-tesla-a10 p4d nvidia-tesla-a100 p4de nvidia-tesla-a100e

We also recommend setting the following taint on your GPU nodes to prevent pods requiring GPU resources from being scheduled on them: - { key = \"nvidia.com/gpu\", value = \"true\", effect = \"NO_SCHEDULE\" }

"},{"location":"guides/self_hosting/#postgresql","title":"PostgreSQL","text":"

The LLM Engine server requires a PostgreSQL database to back data. LLM Engine currently supports PostgreSQL version 14. Create a PostgreSQL database (e.g. AWS RDS PostgreSQL) if you do not have an existing one you wish to connect LLM Engine to.

To enable LLM Engine to connect to the PostgreSQL engine, we create a Kubernetes secret with the PostgreSQL url. An example YAML is provided below:

apiVersion: v1\nkind: Secret\nmetadata:\n  name: llm-engine-database-credentials  # this name will be an input to our Helm Chart\ndata:\n    database_url = \"postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]\"\n

"},{"location":"guides/self_hosting/#redis","title":"Redis","text":"

The LLM Engine server requires Redis for various caching/queue functionality. LLM Engine currently supports Redis version 6. Create a Redis cluster (e.g. AWS Elasticache for Redis) if you do not have an existing one you wish to connect LLM Engine to.

To enable LLM Engine to connect redis, fill out the Helm chart values with the redis host and url.

"},{"location":"guides/self_hosting/#amazon-s3","title":"Amazon S3","text":"

You will need to have an S3 bucket for LLM Engine to store various assets (e.g model weigts, prediction restuls). The ARN of this bucket should be provided in the Helm chart values.

"},{"location":"guides/self_hosting/#amazon-ecr","title":"Amazon ECR","text":"

You will need to provide an ECR repository for the deployment to store model containers. The ARN of this repository should be provided in the Helm chart values.

"},{"location":"guides/self_hosting/#amazon-sqs","title":"Amazon SQS","text":"

LLM Engine utilizes Amazon SQS to keep track of jobs. LLM Engine will create and use SQS queues as needed.

"},{"location":"guides/self_hosting/#identity-and-access-management-iam","title":"Identity and Access Management (IAM)","text":"

The LLM Engine server will an IAM role to perform various AWS operations. This role will be assumed by the serviceaccount llm-engine in the launch namespace in the EKS cluster. The ARN of this role needs to be provided to the Helm chart, and the role needs to be provided the following permissions:

Action Resource s3:Get*, s3:Put* ${s3_bucket_arn}/* s3:List* ${s3_bucket_arn} sqs:* arn:aws:sqs:${region}:${account_id}:llm-engine-endpoint-id-* sqs:ListQueues * ecr:BatchGetImage, ecr:DescribeImages, ecr:GetDownloadUrlForLayer, ecr:ListImages ${ecr_repository_arn}"},{"location":"guides/self_hosting/#helm-chart","title":"Helm Chart","text":"

Now that all dependencies have been installed and configured, we can run the provided Helm chart. The values in the Helm chart will need to correspond with the resources described in the Dependencies section.

Ensure that Helm V3 is installed instructions and can connect to the EKS cluster. Users should be able to install the chart with helm install llm-engine llm-engine -f llm-engine/values_sample.yaml -n <DESIRED_NAMESPACE>. Below are the configurations to specify in the values_sample.yaml file.

Parameter Description Required tag The LLM Engine docker image tag Yes context A user-specified deployment tag No image.gatewayRepository The docker repository to pull the LLM Engine gateway image from Yes image.builderRepository The docker repository to pull the LLM Engine endpoint builder image from Yes image.cacherRepository The docker repository to pull the LLM Engine cacher image from Yes image.forwarderRepository The docker repository to pull the LLM Engine forwarder image from Yes image.pullPolicy The docker image pull policy No secrets.kubernetesDatabaseSecretName The name of the secret that contains the database credentials Yes serviceAccount.annotations.eks.amazonaws.com/role-arn The ARN of the IAM role that the service account will assume Yes service.type The service configuration for the main LLM Engine server No service.port The service configuration for the main LLM Engine server No replicaCount The amount of replica pods for each deployment No autoscaling The autoscaling configuration for LLM Engine server deployments No resources.requests.cpu The k8s resources for LLM Engine server deployments No nodeSelector The node selector for LLM Engine server deployments No tolerations The tolerations for LLM Engine server deployments No affinity The affinity for LLM Engine server deployments No aws.configMap.name The AWS configurations (by configMap) for LLM Engine server deployments No aws.configMap.create The AWS configurations (by configMap) for LLM Engine server deployments No aws.profileName The AWS configurations (by configMap) for LLM Engine server deployments No serviceTemplate.securityContext.capabilities.drop Additional flags for model endpoints No serviceTemplate.mountInfraConfig Additional flags for model endpoints No config.values.infra.k8s_cluster_name The name of the k8s cluster Yes config.values.infra.dns_host_domain The domain name of the k8s cluster Yes config.values.infra.default_region The default AWS region for various resources Yes config.values.infra.ml_account_id The AWS account ID for various resources Yes config.values.infra.docker_repo_prefix The prefix for AWS ECR repositories Yes config.values.infra.redis_host The hostname of the redis cluster you wish to connect Yes config.values.infra.s3_bucket The S3 bucket you wish to connect Yes config.values.llm_engine.endpoint_namespace K8s namespace the endpoints will be created in Yes config.values.llm_engine.cache_redis_url The full url for the redis cluster you wish to connect Yes config.values.llm_engine.s3_file_llm_fine_tuning_job_repository The S3 URI for the S3 bucket/key that you wish to save fine-tuned assets Yes config.values.dd_trace_enabled Whether to enable datadog tracing, datadog must be installed in the cluster No"},{"location":"guides/self_hosting/#play-with-it","title":"Play With It","text":"

Once helm install succeeds, you can forward port 5000 from a llm-engine pod and test sending requests to it.

First, see a list of pods in the namespace that you performed helm install in:

$ kubectl get pods -n <NAMESPACE_WHERE_LLM_ENGINE_IS_INSTALLED>\nNAME                                           READY   STATUS             RESTARTS      AGE\nllm-engine-668679554-9q4wj                     1/1     Running            0             18m\nllm-engine-668679554-xfhxx                     1/1     Running            0             18m\nllm-engine-cacher-5f8b794585-fq7dj             1/1     Running            0             18m\nllm-engine-endpoint-builder-5cd6bf5bbc-sm254   1/1     Running            0             18m\nllm-engine-image-cache-a10-sw4pg               1/1     Running            0             18m \n
Note the pod names you see may be different.

Forward a port from a llm-engine pod:

$ kubectl port-forward pod/llm-engine-<REST_OF_POD_NAME> 5000:5000 -n <NAMESPACE_WHERE_LLM_ENGINE_IS_INSTALLED>\n

Then, try sending a request to get LLM model endpoints for test-user-id:

$ curl -X GET -H \"Content-Type: application/json\" -u \"test-user-id:\" \"http://localhost:5000/v1/llm/model-endpoints\"\n

You should get the following response:

{\"model_endpoints\":[]}\n

Next, let's create a LLM endpoint using llama-7b:

$ curl -X POST 'http://localhost:5000/v1/llm/model-endpoints' \\\n    -H 'Content-Type: application/json' \\\n    -d '{\n        \"name\": \"llama-7b\",\n        \"model_name\": \"llama-7b\",\n        \"source\": \"hugging_face\",\n        \"inference_framework\": \"text_generation_inference\",\n        \"inference_framework_image_tag\": \"0.9.3\",\n        \"num_shards\": 4,\n        \"endpoint_type\": \"streaming\",\n        \"cpus\": 32,\n        \"gpus\": 4,\n        \"memory\": \"40Gi\",\n        \"storage\": \"40Gi\",\n        \"gpu_type\": \"nvidia-ampere-a10\",\n        \"min_workers\": 1,\n        \"max_workers\": 12,\n        \"per_worker\": 1,\n        \"labels\": {},\n        \"metadata\": {}\n    }' \\\n    -u test_user_id:\n

It should output something like:

{\"endpoint_creation_task_id\":\"8d323344-b1b5-497d-a851-6d6284d2f8e4\"}\n

Wait a few minutes for the endpoint to be ready. You can tell that it's ready by listing pods and checking that all containers in the llm endpoint pod are ready:

$ kubectl get pods -n <endpoint_namespace specified in values_sample.yaml>\nNAME                                                              READY   STATUS    RESTARTS        AGE\nllm-engine-endpoint-id-end-cismpd08agn003rr2kc0-7f86ff64f9qj9xp   2/2     Running   1 (4m41s ago)   7m26s\n
Note the endpoint name could be different.

Then, you can send an inference request to the endppoint:

$ curl -X POST 'http://localhost:5000/v1/llm/completions-sync?model_endpoint_name=llama-7b' \\\n    -H 'Content-Type: application/json' \\\n    -d '{\n        \"prompts\": [\"Tell me a joke about AI\"],\n        \"max_new_tokens\": 30,\n        \"temperature\": 0.1\n    }' \\\n    -u test-user-id:\n

You should get a response similar to:

{\"status\":\"SUCCESS\",\"outputs\":[{\"text\":\". Tell me a joke about AI. Tell me a joke about AI. Tell me a joke about AI. Tell me\",\"num_completion_tokens\":30}],\"traceback\":null}\n

"},{"location":"guides/self_hosting/#pointing-llm-engine-client-to-use-self-hosted-infrastructure","title":"Pointing LLM Engine client to use self-hosted infrastructure","text":"

The llmengine client makes requests to Scale AI's hosted infrastructure by default. You can have llmengine client make requests to your own self-hosted infrastructure by setting the LLM_ENGINE_BASE_PATH environment variable to the URL of the llm-engine service.

The exact URL of llm-engine service depends on your Kubernetes cluster networking setup. The domain is specified at config.values.infra.dns_host_domain in the helm chart values config file. Using charts/llm-engine/values_sample.yaml as an example, you would do:

export LLM_ENGINE_BASE_PATH=https://llm-engine.domain.com\n

"},{"location":"guides/token_streaming/","title":"Token streaming","text":"

The Completions APIs support a stream boolean parameter that, when True, will return a streamed response of token-by-token server-sent events (SSEs) rather than waiting to receive the full response when model generation has finished. This decreases latency of when you start getting a response.

The response will consist of SSEs of the form {\"token\": dict, \"generated_text\": str | null, \"details\": dict | null}, where the dictionary for each token will contain log probability information in addition to the generated string; the generated_text field will be null for all but the last SSE, for which it will contain the full generated response.

"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 0ac94adc..11233cbb 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ