If you are using Waii 1.27.x or below, please refer to Configure LLM Endpoints (<= 1.27.x).

LLM Configuration

Endpoint configuration file is used to specify how to connect to different LLMs, it can be OpenAI, Bedrock, Fireworks, etc.

Model file is used to allow users to give a name/description for the model. Which will be exposed via API (so you don't have to expose the actual LLM model name).

LLM Model Configuration

- name: string               # Internal model identifier (e.g., "gpt-4o-2024-08-06")
  description: string        # (Optional) Human-readable description, this is for human readability, no impact on the model configuration
  vendor: string             # (Optional) Provider name (e.g., "openai", "bedrock"), this is for human readability, no impact on the model configuration
  external_name: string      # (Optional) User-facing name (e.g., "GPT-4o"), this will be displayed in the UI, and allow API clients to refer to the model by this name
  default: boolean           # (Required if you have multiple models) Marks as default model
  model_type: string         # (Required) "chat" or "embedding"
  reasoning: boolean         # (Optional) for a "chat" model, is it a reasoning model? 
                             # typical reasoning model like deepseek R1, o3-mini, etc.
                             # reasoning model is required when user needs to enable
                             # "DEEPTHINK" model for query / chat generation.
                             # reasoning model cannot replace the regular chat model.
                             # (e.g. if you only configure reasoning model, it will fail)
  endpoints:                 # (Required) List of endpoint configurations (it can be multiple)
    - <endpoint-config>

Endpoint Configuration

# Common Fields
api_key: string              # Provider API key
access_policy:               # (Optional): Authorization configuration (it can be multiple)
  <See Access Policy below>

# Provider-Specific Fields
## Azure OpenAI / OpenAI compatible endpoint
api_base: string             # API base URL for Azure OpenAI or OpenAI compatible endpoint
deployment_name: string      # Azure deployment name

## AWS Bedrock
aws_region_name: string
anthropic_version: string    # For Claude models
guardrail_identifier: string
guardrail_version: integer

Access Policy Configuration

access_policy:
  - user_id: <string|wildcard>  # "*" for any user
    tenant_id: <string|wildcard> # "*" for any tenant
    org_id: <string|wildcard>   # "*" for any organization

What are the supported LLMs and embeddings?

For Waii, we support the following LLMs and embeddings:

Any OpenAI, or OpenAI compatible endpoint (e.g. via vLLM/Fireworks/Ollama/Azure)
Any AWS Bedrock LLM (e.g. Anthropic Sonnet, Google Gemini, etc.)

For embeddings, we support the following:

Any AWS Bedrock embedding model (e.g. Cohere, OpenAI, etc.)
Any OpenAI compatible embedding model (e.g. via vLLM/Fireworks/Ollama/Azure)

At this moment, Waii only supports embedding models that embeddings up to 1536 in length. Please make sure the chosen embedding model meets this requirement.

Internally, Waii uses litellm client to connect to the LLM/embedding providers. So any of the litellm supported LLM/embeddings are also supported by Waii.

Example policies:

Allow all users to access the model

access_policy:
  - user_id: "*"
    tenant_id: "*"
    org_id: "*"

Allow all users in the "analytics-team" tenant of the "acme-corp" organization to access the model

access_policy:
  - user_id: "*"
    tenant_id: "analytics-team"
    org_id: "acme-corp"

Allow all users in the "east-region" tenant of the "retail-division" organization to access the model

access_policy:
  - user_id: "*"
    tenant_id: "east-region"
    org_id: "retail-division"
  - user_id: "*"
    tenant_id: "west-region" 
    org_id: "wholesale-division"

Example model and endpoint configuration

Azure OpenAI (with embedding)

- name: azure-gpt4
  model_type: chat
  endpoints:
    - api_key: sk-...
      api_base: https://your-resource.openai.azure.com/
      deployment_name: gpt-4-turbo # this is your own deployment name
- name: text-embedding-ada-002
  model_type: embedding
  endpoints:
    - api_key: <azure_api_key>
      api_base: https://<your-domain>i.openai.azure.com/
      deployment_name: ada2

Google Gemini (with embedding)

- name: gemini/gemini-2.5-flash-preview-05-20
  model_type: chat
  default: true
  endpoints:
    - api_key: ...
- name: gemini/gemini-2.5-pro-preview-05-06
  model_type: chat
  # note: this is must for 2.5-pro model see the note below:
  reasoning: true
  endpoints:
    - api_key: ...
- name: gemini/text-embedding-004
  model_type: embedding
  endpoints:
    - api_key: ...

Notes:

You should always include the 2.5 Flash model, as it has a hybrid mode that allows Waii to automatically turn the thinking mode on or off based on the task. Waii relies on it to handle low-latency tasks.
This means that if you want to use the 2.5 Pro model, you must always pair it with the 2.5 Flash model (and set the reasoning to true). The Pro model is thinking-only, and the number of thinking tokens cannot be controlled (at least for the preview-05-06 version). If you only have the Pro model, it will be very slow.

AWS Bedrock Anthropic Sonnet (with embedding)

- name: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0
  description: "Claude 3.5 Sonnet via Bedrock"
  external_name: "Claude 3.5 Sonnet" # specify the name of the model as it will be displayed in the UI, which looks nicer than the model name with long version numbers
  vendor: anthropic
  endpoints:
    - model_type: chat
      aws_region_name: us-west-2
      anthropic_version: "2023-06-01"
- name: bedrock/cohere.embed-english-v3
  model_type: embedding
  endpoints:
    - aws_access_key_id: <...>
      aws_secret_access_key: <...>
      aws_region_name: <...>

If you’re using AWS IAM roles, here is the IAM policy that needs to be attached to the Waii service to access Bedrock.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "aws-marketplace:Subscribe",
                "aws-marketplace:Unsubscribe",
                "aws-marketplace:ViewSubscriptions"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "aws-marketplace:Subscribe"
            ],
            "Resource": "*",
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws-marketplace:ProductId": [
                        "prod-m5ilt4siql27k",
                        "prod-cx7ovbu5wex7g",
                        "prod-ozonys2hmmpeu",
                        "prod-fm3feywmwrog",
                        "a61c46fe-1747-41aa-9af0-2e0ae8a9ce05",
                        "216b69fd-07d5-4c7b-866b-936456d68311",
                        "prod-tukx4z3hrewle",
                        "prod-nb4wqmplze2pm",
                        "b7568428-a1ab-46d8-bab3-37def50f6f6a",
                        "38e55671-c3fe-4a44-9783-3584906e7cad"
                    ]
                }
            }
        }
    ]
}

The above list of ProductId is copied from =https://docs.aws.amazon.com/bedrock/latest/userguide/security_iam_id-based-policy-examples.html

For all the project id mappings for Bedrock models, you can refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-permissions.html

We recommend to use: Sonnet 3.5 (or Sonnet 3.5 V2) as LLM, and Cohere Embed (English) as embedding model.

You can also include guardrail_identifier and guardrail_version if applicable.

For example, guardrail_identifier: arn:aws:bedrock:us-east-1:855284387825:guardrail/h4ykjllg2tl6 and guardrail_version: 3 as below:

OpenAI compatible endpoint (e.g. via vLLM)

- name: openai/my-llm
  description: "My custom OpenAI-compatible LLM"
  external_name: "My LLM"
  model_type: chat
  endpoints:
    - api_key: sk-...
      api_base: https://my-llm-api.com
- name: openai/my-emb
  model_type: embedding
  endpoints:
    - api_base: #such as https://hosted-vllm-api.co

Please note that openai/ must be prefixed to the model name, Waii uses this to determine which LLM API to use. api_base must be the base URL of the endpoint, and model_type must be either chat or embedding.

After you have configured the endpoint configuration file, you can use the model name in the model file (see below).

When you start the server, you can find the logs from stdout looks like

Initializing chat with args {'api_base': '...', 'model': 'openai/my-llm'}
Initializing embedding with args {'api_base': '...', 'model': 'openai/my-emb'}
Checking health of chat model={'api_base': '...', 'model': 'openai/my-llm'}
Successfully health checked chat model={'api_base': '...', 'model': 'openai/my-llm'}
Successfully health checked embedding model={'api_base': '...', 'model': 'openai/my-emb'}

Which indicates the LLM endpoints are successfully connected.

Specify multiple endpoints for a single model (one OpenAI, one Azure)

- name: gpt-4o
  model_type: chat
  endpoints:
    - api_key: sk-...
      api_base: https://your-resource.openai.azure.com/
      deployment_name: gpt-azure-4o # this is your own deployment name
    - api_key: sk-...
      model: gpt-4o-2024-08-06

Specify multiple endpoints which can be used by different tenants (e.g. use different bedrock application inference profiles)

- name: sonnet-3.5
  model_type: chat
  endpoints:
    - model: bedrock/<inference-profile-name-1>
      aws_access_key_id: ...
      aws_secret_access_key: xxxxxxxx
      aws_region_name: <...>
      anthropic_version: <...>
      access_policy:
        - user_id: "*"
          tenant_id: "*"
          org_id: "acme-corp"
    - model: bedrock/<inference-profile-name-2>
      aws_access_key_id: ...
      aws_secret_access_key: xxxxxxxx
      aws_region_name: <...>
      anthropic_version: <...>
      access_policy:
        - user_id: "*"
          tenant_id: "*"
          org_id: "acme-corp"

Note: the above example is to allow different tenant uses the same model name (sonnet-3.5) but with different inference profiles.

A similar example is to allow different users to use the same model name (gpt-4o) but with different azure region

- name: gpt-4o
  model_type: chat
  endpoints:
    - api_key: sk-...
      api_base: https://your-resource-eu.openai.azure.com/
      deployment_name: gpt-azure-4o-eu
      access_policy:
        - user_id: "*"
          tenant_id: "*"
          org_id: "acme-corp-eu"
    - api_key: sk-...
      api_base: https://your-resource-us.openai.azure.com/
      deployment_name: gpt-azure-4o-us
      access_policy:
        - user_id: "*"
          tenant_id: "*"
          org_id: "acme-corp-us"

Specify multiple endpoints, which are different models used by different tenants

- name: gpt-4o
  model_type: chat
  endpoints:
    - api_key: sk-...
      api_base: https://your-resource-eu.openai.azure.com/
      deployment_name: gpt-azure-4o-eu
      access_policy:
        - user_id: "*"
          tenant_id: "*"
          org_id: "paid-customer"
- name: gpt-4o-mini
  model_type: chat
  endpoints:
    - api_key: sk-...
      api_base: https://your-resource-us.openai.azure.com/
      deployment_name: gpt-azure-4o-mini-us
      access_policy:
        - user_id: "*"
          tenant_id: "*"
          org_id: "free-customer"

The above example is to allow different customers to use the different models, free customer uses gpt-4o-mini, paid customer uses gpt-4o.

Ollama endpoint

Ollama endpoints are OpenAI compatible as well. These can be specified using the following patterns. For these endpoints, note that the model must be preceded by ollama/. Follow other instructions above.

- name: ollama/<model> # such as llama3.3
  endpoints:
    - api_base: # such as http://localhost:11434
      model_type: chat
- name: ollama/<model> # such as nomic-embed-text
  endpoints:
    - api_base: # such as http://localhost:11434
      model_type: embedding

Reasoning model

Reasoning model config looks like this

- name: fireworks_ai/deepseek-r1
  reasoning: true
  endpoints:
    - api_key: ...
      model_type: chat
- name: ... (other "regular" chat/embedding models)
  endpoints:
    - api_base: # such as http://localhost:11434
      model_type: embedding

Migrate from old model file to new model file

If you are using Waii 1.27.x or below, you can refer to the following section to migrate to the new model file.

For 1.27, Waii needs two files:

llm-config.yaml: to specify the LLM endpoints
model.yaml: to specify the model name and which endpoints to use

For 1.28, Waii needs only one file:

llm-config.yaml: to specify the LLM endpoints

The model.yaml contains the following fields for each model:

- name: Claude 3.5 Sonnet
  models:
  - bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0
- name: GPT-4o
  models:
  - gpt-4o-2024-08-06
- name: Automatic
  description: (Pointed to GPT-4o)
  models:
  - gpt-4o-2024-08-06

The name field is the name of the model as it will be displayed in the UI, and can be specified as model when you generate query (QueryGenerationRequest).

The models field is the list of models that can be used for this model.

The description field is the description of the model, which will be displayed in the UI.

Steps for migration:

Step 1. When you using 1.28, you can remove the models field, because the endpoint configuration file already contains the list of endpoints for each model.

Step 2. description field can be add to the endpoint configuration file for each model:

- name: gpt-4o
  description: "GPT-4o (a model from OpenAI)"
  ...

Step 3. Migrate Automatic model:

You can add default field to the endpoint configuration file, you can only add one default model for entire endpoint configuration file.

- name: gpt-4o
  # this is same as 'Automatic' model in the old model file
  default: true
  ...

Backward compatibility: We will continue support the old model file, but we will not add any new features to the old model file. And it will be removed in the future.

If you want to use new features, such as specify multiple endpoints for a single model, or specify different models for different tenants, you have to use the new model file.

LLM rate limit setting

In order to make it usable by multiple users, or run the query generation in parallel. You should make sure your LLM endpoint has sufficient rate limit for the endpoint which will be used by Waii.

Recommendation:

For both of the embedding (such as text-embedding-ada-002) and LLM endpoints (such as OpenAI, Sonnet 3.5, etc), you should set the make sure Token-per-minute (TPM) is at least 100K (300k is better). And Request-per-minute (RPM) to be at least 500 (2000 is better).

Please note that the avg token usage will be way lower than the TPM, but it is good to have a buffer for the peak usage. (Especially when you run multiple queries in parallel, or add a database because initial knowledge graph building will consume more tokens for the first few minutes.)

LLM Model Configuration​

Endpoint Configuration​

Access Policy Configuration​

What are the supported LLMs and embeddings?​

Example policies:​

Example model and endpoint configuration​

Azure OpenAI (with embedding)​

Google Gemini (with embedding)​

AWS Bedrock Anthropic Sonnet (with embedding)​

OpenAI compatible endpoint (e.g. via vLLM)​

Specify multiple endpoints for a single model (one OpenAI, one Azure)​

Specify multiple endpoints which can be used by different tenants (e.g. use different bedrock application inference profiles)​

Specify multiple endpoints, which are different models used by different tenants​

Ollama endpoint​

Reasoning model​

Migrate from old model file to new model file​

LLM rate limit setting​