docs
User Manual
Customer
API Reference
Large Language Models

Large Language Model

LLM(Large Language Model) is running on top of HolmesAI infrastructure.

We provide a list of popular LLM models, You can create our Automatic load balancing services to run them, And feed your apps or customers with our OpenAI compatiple APIs.

Before started

You should get some parameters before get started : SERVICE_ID, API_KEY and MODEL, you can find them on our dashbord (opens in a new tab).

API reference

/v1/chat/completions

Request Parameters

Parameter nameTypeDescriptionRequired
modelStringmodel typeyes
messagesArrayA list of messages comprising the conversation so far.yes
roleStringThe role of the messages author:system,user,assistant,toolyes
contentStringThe contents of the system message.yes
temperatureFloatWhat sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p, but not both.No
top_pFloatAn alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.No
presence_penaltyFloatNumber between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.No
stopArrayUp to 4 sequences where the API will stop generating further tokens.No
max_tokensIntThe maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API.No
streamBoolIf set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.No

Response Elements

Parameter nameTypeDescription
idStringSession id
createdInt64The Unix timestamp (in seconds) of when the chat completion was created.
choices[Object]A list of chat completion choices.Can be more than one if n is greater than 1.
indexIntThe index of the choice in the list of choices
messageObjectA chat completion message generated by the model.
roleStringThe role of the author of this message.
contentStringThe contents of the message.
finish_reasonStringThe reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence, length if the maximum number of tokens specified in the request was reached, content_filter if content was omitted due to a flag from our content filters,tool_calls if the model called a tool, or function_call (deprecated) if the model called a function.
usageObjectUsage statistics for the completion request.
prompt_tokensInt64Number of tokens in the prompt.
completion_tokensInt64Number of tokens in the generated completion.
total_tokensInt64Total number of tokens used in the request (prompt + completion).

Usage

python

pip install -U openai
import os
import openai
 
client = openai.OpenAI(
    base_url="https://modelapi.holmesai.xyz./$SERVICE_ID/api/v1",
    api_key="$API_KEY")
)
 
completion = client.chat.completions.create(
    model="$MODEL",
    messages=[
        {"role": "user", "content": "say hello"},
    ],
    max_tokens=128,
    stream=True,
)
 
for chunk in completion:
    if not chunk.choices:
        continue
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="")

curl

curl https://modelapi.holmesai.xyz./$SERVICE_ID/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
     "model": "$MODEL",
     "messages": [{"role": "user", "content": "say hello"}],
     "max_tokens": 128
   }'