Inference Gateway for FIRST toolkit

FIRST (Federated Inference Resource Scheduling Toolkit) is a system that enables LLM (Large Language Model) inference as a service, allowing secure, remote execution of LLMs through an OpenAI-compatible API. FIRST's Inference Gateway is a RESTful API that validates and authorizes inference requests to scientific computing clusters using Globus Auth and Globus Compute.

System Architecture
Prerequisites
Setup Overview
Gateway Setup
Inference Backend Setup (Remote/Local)
Connecting Gateway and Backend
- Update Fixtures
- Load Fixtures
Starting the Services
- Gateway (Docker or Bare Metal)
- Inference Backend (Globus Compute Endpoint)
Verifying the Setup
Benchmarking
Production Considerations (Nginx)
Monitoring
Troubleshooting

System Architecture

The Inference Gateway consists of several components:

API Gateway: Django-based REST/Ninja API that handles authorization and request routing.
Globus Auth: Authentication and authorization service.
Globus Compute Endpoints: Remote execution framework on HPC clusters (or local machines).
Inference Server Backend: (e.g., vLLM) High-performance inference service for LLMs running alongside the Globus Compute Endpoint.

Prerequisites

Python 3.11+
PostgreSQL Server (included in the Docker deployment)
Poetry
Docker and Docker Compose (Recommended for Gateway deployment)
Globus Account
Access to a compute resource (HPC cluster or a local machine with sufficient resources for the chosen inference server and models)

Setup Overview

The setup involves two main parts:

Gateway Setup: Installing and configuring the central API gateway service.
Inference Backend Setup: Setting up the inference server (like vLLM) and the Globus Compute components on the machine(s) where models will run.

These parts can be done in parallel, but configuration details from each are needed to link them.

Gateway Setup

This section covers setting up the central Django application.

Installation (Docker or Bare Metal)

Clone the repository first:

git clone https://github.com/auroraGPT-ANL/inference-gateway.git
cd inference-gateway

Option 1: Docker Deployment (Recommended)

# Create necessary directories needed by docker-compose.yml
mkdir -p logs prometheus
# Create a basic prometheus config if you don't have one
# echo "global:\n  scrape_interval: 15s\nscrape_configs:\n  - job_name: 'prometheus'\n    static_configs:\n      - targets: ['localhost:9090']" > prometheus/prometheus.yml

# Configuration is done via the .env file (see next steps)

See Starting the Services for how to run this after configuration. See docker-compose.yml for details on included services (Postgres, Redis, optional monitoring).

Option 2: Bare Metal Setup / Local Development (need a PostgreSQL server)

# Set up Python environment with Poetry
poetry config virtualenvs.in-project true
poetry env use python3.11
poetry install

# Activate the environment
poetry shell
# Can also use 'source .venv/bin/activate'

# Ensure PostgreSQL server is running and accessible.
# Configuration is done via the .env file (see next steps)

Register Two Globus Applications

Service API Application

To handle authorization within the API, the Gateway needs to be registered as a Globus Service API application:

Visit developers.globus.org and sign in.
Under Register an..., click on Register a service API ....
Select none of the above - create a new project or select one of your existing projects.
Complete the new project form (not needed if you selected an existing project).
Complete the registration form:
- Set App Name (e.g., "My Inference Gateway").
- Add Redirect URIs. For local development with the default Django server (runserver), use http://localhost:8000/complete/globus/. For production, use https://<your-gateway-domain>/complete/globus/.
- You can leave the check boxes to their default setting.
- Set Privacy Policy and Terms & Conditions URLs if applicable.
After registration, a Client UUID will be assigned to your Globus application. Generate a Client Secret by clicking on the Add Client Secret button on the right-hand side. You will need both for the .env configuration. The UUID will be for GLOBUS_APPLICATION_ID, and the secret will be for GLOBUS_APPLICATION_SECRET.

Add a Globus Scope to your Service API Application

A scope is needed for users to generate an access token to access the inference service. First export the API client credentials in a terminal:

export CLIENT_ID="<Your-Gateway-Service-API-Globus-App-Client-UUID>"
export CLIENT_SECRET="<Your-Gateway-Service-API-Globus-App-Client-Secret>"

Make a request to Globus Auth to attach an action_all scope to your API client. The curl command below will also embed a dependent Globus Group scope to the main scope, which is needed for the API to query the user's Group memberships during the authorization process.

curl -X POST -s --user $CLIENT_ID:$CLIENT_SECRET \
    https://auth.globus.org/v2/api/clients/$CLIENT_ID/scopes \
     -H "Content-Type: application/json" \
     -d '{
            "scope": {
                "name": "Action Provider - all",
                "description": "Access to inference service.",
                "scope_suffix": "action_all",
                "dependent_scopes": [
                    {
                        "scope": "73320ffe-4cb4-4b25-a0a3-83d53d59ce4f",
                        "optional": false,
                        "requires_refresh_token": true
                    }
                ]
            }
         }'

To verify that the scope was successfully created, query the details of your API client, and look for the UUID in the scopes field:

curl -s --user $CLIENT_ID:$CLIENT_SECRET https://auth.globus.org/v2/api/clients/$CLIENT_ID

Query the details of your newly created scope (you shoud see 73320ffe-4cb4-4b25-a0a3-83d53d59ce4f in the dependent_scopes field):

export SCOPE_ID="<copy-paste-your-scope-uuid-here>"
curl -s --user $CLIENT_ID:$CLIENT_SECRET https://auth.globus.org/v2/api/clients/$CLIENT_ID/scopes/$SCOPE_ID

Service Account Application

To handle the communication between the Gateway API and the compute resources (the Inference Backend), you need to create a Globus Service Account application. This application represents the Globus identity that will own the Globus Compute endpoints.

Visit developers.globus.org and sign in.
Under Projects, click on project used to register your Service API application from the previous step.
Click on Add an App.
Select Register a service account ....
Complete the registration form:
- Set App Name (e.g., "My Inference Endpoints").
- Set Privacy Policy and Terms & Conditions URLs if applicable.
After registration, a Client UUID will be assigned to your Globus application. Generate a Client Secret by clicking on the Add Client Secret button on the right-hand side. You will need both for the .env configuration. The UUID will be for POLARIS_ENDPOINT_ID, and the secret will be for POLARIS_ENDPOINT_SECRET.

Configure Environment (.env)

Create a .env file in the project root (inference-gateway/). This file is used by both Docker and bare-metal setups (if using python-dotenv).

# --- Core Django Settings ---
SECRET_KEY="<generate-a-strong-random-key>" # Can be generate with Django, e.g. python -c 'from django.core.management.utils import get_random_secret_key; print(get_random_secret_key())'
DEBUG=True # Set to False for production
ALLOWED_HOSTS="localhost,127.0.0.1" # Add your gateway domain/IP for production

# --- Globus Credentials (from the "Register Globus Application" step) ---
# Client ID and Secret of the Globus Service API application
GLOBUS_APPLICATION_ID="<Your-Gateway-Service-API-Globus-App-Client-UUID>"
GLOBUS_APPLICATION_SECRET="<Your-Gateway-Service-API-Globus-App-Client-Secret>"
# Optional: Restrict access to specific Globus Groups (space-separated UUIDs)
# GLOBUS_GROUPS="<group-uuid-1> <group-uuid-2>"
# Optional: Enforce specific Identity Provider usage (JSON string)
# AUTHORIZED_IDPS='{"Your Institution Domain": "your-institution-uuid"}'
# AUTHORIZED_GROUPS_PER_IDP='{"Institution Domain": "coma-separated list of group uuids"}'
# Optional: Enforce Globus high assurance policies (space-separated UUIDs)
# GLOBUS_POLICIES="<policy-uuid-1>"
# Client ID and Secret of the Globus Service Account application
POLARIS_ENDPOINT_ID="<Your-Service-Account-Globus-App-Client-UUID>"
POLARIS_ENDPOINT_SECRET="<Your-Service-Account-Globus-App-Client-Secret>"

# --- CLI Authentication Helper ---
# Public Client ID used by the inference-auth-token.py script for user authentication
CLI_AUTH_CLIENT_ID="58fdd3bc-e1c3-4ce5-80ea-8d6b87cfb944" # Default public client, replace if needed
# Optional: Comma-separated list of allowed domains for CLI login (e.g., "anl.gov,alcf.anl.gov")
# CLI_ALLOWED_DOMAINS="anl.gov,alcf.anl.gov"
# Optional: Override token storage directory for CLI script
# CLI_TOKEN_DIR="~/.globus/my_custom_token_dir"
# Optional: Override App Name used by CLI script
# CLI_APP_NAME="my_custom_cli_app"


# --- Database Credentials ---
# Used by Django Gateway, Postgres container, postgres-exporter
POSTGRES_DB="inferencegateway"
POSTGRES_USER="inferencedev"
POSTGRES_PASSWORD="inferencedevpwd" # CHANGE THIS for production
# Hostname: Use "postgres" for Docker-compose networking.
# Use "localhost" for bare-metal if DB is local.
# Use "host.docker.internal" if Gateway runs in Docker but DB runs on the host machine.
PGHOST="postgres" # Important: Use "postgres" for Docker
PGPORT=5432
PGUSER="dataportaldev"
PGPASSWORD="inferencedevpwd" # CHANGE THIS for production
PGDATABASE="inferencegateway"

# --- Redis --- Used for caching, async tasks
# Use "redis" for Docker-compose networking.
# Use "localhost" (or relevant hostname) for bare-metal.
REDIS_URL="redis://redis:6379/0"

# --- Gateway Specific Settings ---
MAX_BATCHES_PER_USER=2 # Max concurrent batch jobs allowed per user

# --- Streaming Configuration ---
# Internal streaming server configuration (required for streaming functionality)
STREAMING_SERVER_HOST="localhost:8080" # Host and port of your internal streaming server
INTERNAL_STREAMING_SECRET="your-internal-streaming-secret-key" # Secret key for internal streaming authentication
# Both /chat/completions and /completions endpoints support streaming via the 'stream' parameter
# However, 'stream_options' parameter is only supported in /completions endpoint

# --- Optional: Grafana Admin Credentials (for Docker setup) ---
# GF_SECURITY_ADMIN_USER=admin
# GF_SECURITY_ADMIN_PASSWORD=admin

# QSTAT ENDPOINTS
# SOPHIA_QSTAT_ENDPOINT_UUID=""
# SOPHIA_QSTAT_FUNCTION_UUID=""

Important: Securely store all of your credentials and secrets, especially in production. The CLI_AUTH_CLIENT_ID is typically a public client and doesn't need to be kept secret.

Initialize Gateway Database

Once you have configured the .env file, initialize the Gateway's database schame.

Option 1: Docker (also works with Podman)

First, build and run the Gateway's containers (there should be 7 in total):

docker-compose -f docker-compose.yml up -d

Initialize the database:

docker-compose -f docker-compose.yml exec inference-gateway python manage.py makemigrations
docker-compose -f docker-compose.yml exec inference-gateway python manage.py migrate

Option 2: Bare Metal

If you are not using Docker, you must have a PostgreSQL service running.

# Ensure DB connection vars are exported in your shell or use python-dotenv
# (e.g., export PGHOST=localhost POSTGRES_USER=...)
python manage.py makemigrations
python manage.py migrate

Inference Backend Setup (Remote/Local)

This section covers setting up the components on the machine where the AI models will actually run (e.g., an HPC compute node, a powerful workstation).

Virtual Python Environment

All of the instruction below must be done within a Python virtual environment. Make sure to use a virtual environment with the same Python version as the one used to deploy Gateway API (Python 3.11 in this example). This will avoid version mismatch errors when using Globus Compute.

# Example 1 with conda
conda create -n vllm-env python=3.11 -y
conda activate vllm-env

# Example 2 with python venv
python3.11 -m venv vllm-env
source vllm-env/bin/activate  # On Windows use `vllm-env\Scripts\activate`

Install Inference Server (e.g., vLLM) and Globus Compute

Choose and install an inference serving framework. vLLM is recommended for performance with many transformer models:

# Make sure you have activated your virtual environment

# Basic vLLM installation from source
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
# For specific hardware acceleration (CUDA, ROCm), follow official docs:
# https://docs.vllm.ai/en/latest/getting_started/installation.html

Install the Globus Compute Endpoint software and the Globus Compute SDK:

pip install globus-compute-sdk globus-compute-endpoint

Register Globus Compute Functions

The Gateway interacts with the inference server via functions registered with Globus Compute. You need to register:

Inference Function: Wraps the call to your inference server (e.g., vLLM OpenAI-compatible endpoint).
Status Function (Optional but Recommended): Queries the cluster scheduler (e.g., PBS qstat) and node status to aid federated routing.

Important: When registering Globus Compute functions and endpoints, you need to explicitly tie the function/endpoint identity back to your Globus Service Account application (not the Service API application). Do this by exporting the client ID (value of POLARIS_ENDPOINT_ID in .env) and Secret (value of POLARIS_ENDPOINT_SECRET in .env) as environment variables, before running the registration script or configuring the endpoint:

export GLOBUS_COMPUTE_CLIENT_ID="<Value-of-POLARIS_ENDPOINT_ID-from-.env>"
export GLOBUS_COMPUTE_CLIENT_SECRET="<Value-of-POLARIS_ENDPOINT_SECRET-from-.env>"

Now, register the necessary functions:

# Ensure you are in the correct Python environment

# Navigate to the `compute-functions` directory in your local clone of the repository.
cd path/to/inference-gateway/compute-functions

# Register the vLLM inference function (modify the script if needed)
# See compute-functions/vllm_register_function.py
python vllm_register_function.py
# Note the output Function UUID (e.g., <vllm-function-uuid>)
# Output example:
#   Function registered with UUID - ....
#   The UUID is stored in vllm_register_function_sophia_multiple_models.txt.

# Register the qstat/status function (modify script for your scheduler if needed)
python qstat_register_function.py
# Note the output Function UUID (e.g., <qstat-function-uuid>)

# (Register other functions like batch processing if needed)

Important: Keep track of the Function UUIDs generated.

Configure a Globus Compute Endpoint

This endpoint runs on the backend machine, listens for tasks from Globus Compute, and executes the registered functions.

# Ensure you are in the correct Python environment

# Configure a new endpoint (follow prompts)
# Here we use "my-compute-endpoint" for the name but you can choose another name (e.g. local-vllm)
globus-compute-endpoint configure my-compute-endpoint

# This creates a configuration directory, e.g., ~/.globus_compute/my-compute-endpoint/
# Edit the config.yaml inside that directory.

Note on HPC Configuration Examples:

See local-vllm-endpoint.yaml for a configuration example that submits tasks on local hardware. For an example of submitting inference tasks through a cluster scheduler like PBS (specifically on ALCF's Sophia cluster), see sophia-vllm-config-template-v2.0.yaml. Adapting cluster examples requires understanding the specific HPC environment (scheduler, modules, file paths, etc.). Refer to Globus Compute Endpoint Docs for all options.

If you adopted the above configuration examples, make sure to edit the following:

allowed_functions: Make sure the function UUIDs (one per line) are the ones you registered in the Register Globus Compute Functions section.
worker_init: Note: For local setups, you'll typically activate your environment directly (e.g., source vllm-env/bin/activate). For cluster setups like the Sophia example, you might use a shared setup script (e.g., source <my-path>inference-gateway/compute-endpoints/common_setup.sh). Ensure you point to the correct setup script or activation command for your environment.

After configuring config.yaml:

# Start the endpoint
globus-compute-endpoint start my-compute-endpoint

# Note the Endpoint UUID displayed after starting.
# You can always recover the UUID by typing `globus-compute-endpoint list`

Keep track of the Endpoint UUID.

Connecting Gateway and Backend

Update Fixtures

Edit the relevant fixtures file in the Gateway project directory (inference-gateway/fixtures/).

For federated access, edit federated_endpoints.json.

Example: fixtures/federated_endpoints.json

[
    {
        "model": "resource_server.federatedendpoint",
        "pk": 1,
        "fields": {
            "name": "OPT 125M (Federated)",
            "slug": "federated-opt-125m",
            "target_model_name": "facebook/opt-125m", // Model users request
            "description": "Federated access point for the facebook/opt-125m model.",
            "targets": [
                {
                    "cluster": "local",
                    "framework": "vllm",
                    "model": "facebook/opt-125m", // Model served by this specific target
                    "endpoint_slug": "local-vllm-facebook-opt-125m", // Unique ID for this target
                    "endpoint_uuid": "<local-endpoint-UUID-from-previous-step>",
                    "function_uuid": "<local-vllm-function-uuid-from-previous-step>",
                    "api_port": 8001 // Port your local vLLM server runs on
                }
                // Add more targets here for the same model on different clusters/frameworks
                // Example: A target on an HPC cluster like Sophia
                // {
                //    "cluster": "sophia",
                //    "framework": "vllm",
                //    "model": "facebook/opt-125m", // Assuming OPT-125m is also deployed there
                //    "endpoint_slug": "sophia-vllm-facebook-opt-125m",
                //    "endpoint_uuid": "<sophia-endpoint-UUID>",
                //    "function_uuid": "<sophia-vllm-function-UUID>",
                //    "api_port": 8000 // Port on Sophia's compute node
                // }
            ]
        }
    }
    // Add more FederatedEndpoint entries for other models
]

For standard, non-federated access, edit endpoints.json.

Example: fixtures/endpoints.json

[
    {
        "model": "resource_server.endpoint",
        "pk": 1, // Or next available primary key
        "fields": {
            "endpoint_slug": "local-vllm-facebook-opt-125m", // Example for local OPT-125m
            "cluster": "local", // Or wherever your compute endpoint is hosted
            "framework": "vllm",
            "model": "facebook/opt-125m", // Model served by this endpoint
            "api_port": 8001, // Example port for local vLLM server
            "endpoint_uuid": "<local-endpoint-uuid-from-previous-step>",
            "function_uuid": "<local-vllm-function-uuid-from-previous-step>",
            "batch_endpoint_uuid": "<optional-endpoint-for-batch>",
            "batch_function_uuid": "<optional-function-for-batch>",
            "allowed_globus_groups": "" // Optional: Restrict this target further
        }
    }
    // Add more Endpoint entries for other models/clusters
]

Replace placeholders (<...>) with the UUIDs and details from the previous steps.

Load Fixtures

Load the updated fixture file into the Gateway database.

Option 1: Docker:

Make sure the fixture json files are updated within the running container. You can do this by editing the files directly within the container:

docker-compose -f docker-compose.yml exec inference-gateway /bin/bash
# The edit the /app/fixtures/endpoints.json (or federated_endpoints.json) file

or by transfering the edited local file directly into the container:

docker cp fixtures/endpoints.json inference-gateway_inference-gateway_1:/app/fixtures/
# Or: docker cp fixtures/federated_endpoints.json inference-gateway_inference-gateway_1:/app/fixtures/

Load the fixtures into the database:

docker-compose -f docker-compose.yml exec inference-gateway python manage.py loaddata fixtures/endpoints.json
# Or: docker-compose ... exec inference-gateway python manage.py loaddata fixtures/federated_endpoints.json

Option 2: Bare Metal:

python manage.py loaddata fixtures/endpoints.json
# Or: python manage.py loaddata fixtures/federated_endpoints.json

Starting the Services

Gateway (Docker or Bare Metal)

Docker:

# Start all services defined in docker-compose.yml (if not already running)
docker-compose -f docker-compose.yml up --build -d

Bare Metal (Development):

# Ensure DB connection vars are set
poetry shell
python manage.py runserver 0.0.0.0:8000 # Or your preferred port

Bare Metal (Production with Gunicorn):

# Ensure environment variables are set
poetry shell
poetry run gunicorn \
    inference_gateway.asgi:application \
    -k uvicorn.workers.UvicornWorker \
    -b 0.0.0.0:8000 \
    --workers 5 \
    --log-level info
    # Add other Gunicorn flags as needed (--threads, --timeout, log files etc.)

Inference Backend (Globus Compute Endpoint)

Ensure the Globus Compute endpoint is running on the backend machine:

# On the HPC login node / backend machine
globus-compute-endpoint start <my-endpoint-name>

Verify the associated inference server (e.g., vLLM) is started by the endpoint's worker initialization or job submission process.

Verifying the Setup

Once both the Gateway and at least one Backend Compute Endpoint (with its inference server) are running, you can send a test request. You'll need a valid Globus authentication token obtained for the gateway's scope.

Setting up ngrok for Testing Streaming (Optional)

For testing streaming functionality especially when working with remote endpoints, you can use ngrok to create a secure tunnel to your local gateway:

Install ngrok: Visit ngrok.com and follow the installation instructions for your platform.
Start ngrok tunnel: Start ngrok to create a tunnel to your local gateway (assuming it's running on port 8000):
```
ngrok http 8000
```
This will display a public URL (e.g., https://abcd1234.ngrok.io) that tunnels to your local gateway.

Test streaming with Python script: Here's an example Python script for testing streaming functionality:

import sys
import json
import time
import openai

if len(sys.argv) != 2:
    print("Usage: python test_streaming.py <access_token>")
    sys.exit(1)

access_token = sys.argv[1]

# Configure the OpenAI client to connect to your gateway via ngrok
client = openai.OpenAI(
    # Replace with your ngrok URL or direct gateway URL
    base_url="https://your-ngrok-url.ngrok.io/resource_server/sophia/vllm/v1",
    # Use your Globus access token
    api_key=access_token
)

# Define your prompt
messages = [
    {"role": "user", "content": "Explain what is the best way to learn python"}
]

start_time = time.time()
total_chars = 0

print("🤖 AI Response: ", end="")

# Call the chat completions endpoint with stream=True
try:
    stream = client.chat.completions.create(
        model="openai/gpt-oss-20b",  # Replace with your deployed model
        messages=messages,
        stream=True
    )
    
    # Iterate over the stream to get the response chunks in real-time
    for chunk in stream:
        # Extract the content from the chunk (standard OpenAI format)
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            # Print the content without a newline and flush the buffer
            print(content, end="", flush=True)
            total_chars += len(content)
    
    # Print a final newline character after the stream is complete
    print()
    
except Exception as e:
    print(f"\n❌ Error: {e}")

end_time = time.time()
latency = end_time - start_time
print(f"\n⏱️ Streaming Latency: {latency:.2f} seconds")
print(f"📦 Throughput: {total_chars/latency:.2f} chars/sec")

Save this as test_streaming.py and run it with your access token:

python test_streaming.py $MY_TOKEN

Get a Token using the Helper Script:
- Ensure your .env file has GLOBUS_APPLICATION_ID (for the gateway) and CLI_AUTH_CLIENT_ID (for the helper script, a default public one is provided) set. You also need to configure CLI_ALLOWED_DOMAINS to target your specific identity provider (e.g. your institution's SSO). By default, without specifying CLI_ALLOWED_DOMAINS, the only allowed providers are anl.gov and alcf.anl.gov.
- Authenticate (First time or if tokens expire): Run the authentication script. This will open a browser window for Globus login.
```
python inference-auth-token.py authenticate
```
  This stores refresh and access tokens locally (typically in ~/.globus/app/...).
- Force Re-authentication: If you need to change Globus accounts or encounter permission errors possibly related to expired sessions or policy changes, log out via https://app.globus.org/logout and force re-authentication:
```
python inference-auth-token.py authenticate --force
```
- Get Access Token: Retrieve your current, valid access token:
```
export MY_TOKEN=$(python inference-auth-token.py get_access_token)
echo "Token stored in MY_TOKEN environment variable."
# echo $MY_TOKEN # Uncomment to view the token directly
```
  This command automatically uses the stored refresh token to get a new access token if the current one is expired.
- Token Validity: Access tokens are valid for 48 hours. Refresh tokens allow getting new access tokens without re-logging in via browser, but they expire after 6 months of inactivity. Some institutions or policies might enforce re-authentication more frequently (e.g., weekly).

Send Request using cURL: You can adjust the model name and payload as appropriate.

Example with a federated Globus Compute endpoint (assuming federated_endpoints.json was used):

# Example using the federated endpoint for OPT-125m
curl -X POST http://127.0.0.1:8000/resource_server/v1/chat/completions \\
      -H "Authorization: Bearer $MY_TOKEN" \\
      -H "Content-Type: application/json" \\
      -d '{
        "model": "facebook/opt-125m",
        "messages": [
          {"role": "user", "content": "Explain the concept of Globus Compute in simple terms."}
        ],
        "max_tokens": 150
      }'

Example with a standard, non-federated Globus Compute endpoint (assuming endpoints.json was used):

# Example targeting a specific vLLM endpoint on the 'local' cluster for OPT-125m
curl -X POST http://127.0.0.1:8000/resource_server/local/vllm/v1/chat/completions \\
      -H "Authorization: Bearer $MY_TOKEN" \\
      -H "Content-Type: application/json" \\
      -d '{
        "model": "facebook/opt-125m",
        "messages": [
          {"role": "user", "content": "Explain the concept of Globus Compute in simple terms."}
        ],
        "max_tokens": 150
      }'

A successful response will be a JSON object containing the model's completion.

Benchmarking

A benchmark script is provided in examples/load-testing/benchmark-serving.py (adapted from vLLM's benchmarks) to test the performance of your deployed endpoints.

Download Dataset: The script often uses the ShareGPT dataset. You can download it manually or let the script attempt to download it. A common location to place it might be examples/load-testing/ShareGPT_V3_unfiltered_cleaned_split.json.

# Example: Download ShareGPT dataset (check the script for the exact URL if needed)
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -P examples/load-testing/

Run Benchmark: Execute the script, pointing it to your gateway's API endpoint.

# Example: Benchmark local federated endpoint for opt-125m
python examples/load-testing/benchmark-serving.py \
    --backend vllm \
    --model facebook/opt-125m \
    --base-url http://127.0.0.1:8000/resource_server/v1/chat/completions \
    --dataset-name sharegpt \
    --dataset-path examples/load-testing/ShareGPT_V3_unfiltered_cleaned_split.json \
    --output-file benchmark_results_local_opt125m.jsonl \
    --num-prompts 100 # Adjust as needed
    # Add --request-rate or --max-concurrency for more controlled tests
    # Add --disable-ssl-verification if using self-signed certs locally
    # Add --disable-stream if benchmarking a non-streaming endpoint

You can adapt the --base-url to point to other OpenAI-compatible endpoints (like the OpenAI API itself, Anthropic, etc., potentially requiring environment variables like OPENAI_API_KEY). See the script's arguments (--help) for more options.

Production Considerations (Nginx)

For production deployments (especially bare-metal), running Django/Gunicorn behind a reverse proxy like Nginx is highly recommended for:

HTTPS/SSL Termination: Securely handle TLS encryption.
Load Balancing: Distribute requests across multiple Gunicorn workers/instances.
Serving Static Files: Efficiently serve CSS, JS, images.
Security: Add rate limiting, header checks, etc.
Hostname Routing: Direct traffic based on domain names.

Example Nginx Configuration Snippet (/etc/nginx/sites-available/inference_gateway):

upstream app_server {
    # fail_timeout=0 means we always retry an upstream even if it failed
    # to return a good HTTP response (in case the Gunicorn worker recovers).
    server 127.0.0.1:8000 fail_timeout=0;
    # Add more servers here if running multiple Gunicorn instances
}

server {
    listen 80;
    # listen 443 ssl;
    server_name your-gateway-domain.org;

    # ssl_certificate /path/to/your/cert.pem;
    # ssl_certificate_key /path/to/your/key.pem;

    client_max_body_size 4G;

    location /static/ {
        alias /path/to/inference-gateway/static/;
    }

    location / {
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Host $http_host;
        proxy_redirect off;
        proxy_pass http://app_server;
    }
}

Remember to collect static files (python manage.py collectstatic) and configure Nginx to serve them from the specified alias path. Consult Nginx documentation for details on SSL setup and other options.

Monitoring

Access the monitoring dashboard (if deployed via Docker Compose with monitoring enabled):

Grafana: http://localhost:3000 (default credentials: admin/admin) - Visualizes metrics.
Prometheus: http://localhost:9090 - Collects metrics.

The Grafana dashboard includes:

Application metrics (request rates, latency, error rates)
System metrics (CPU, memory, disk I/O via Node Exporter)
Database metrics (connection counts, query performance via PostgreSQL Exporter)
Custom Gateway metrics (inference rates, token counts - requires metrics endpoint in Gateway)

Example Usage Documentation

For an example of how user-facing documentation might look for a deployed instance of this gateway, see the ALCF Inference Endpoints documentation.

Troubleshooting

Docker Nginx 404/502/504 Errors: Verify nginx_app.conf mount and upstream definition, check inference-gateway logs.
Database Connection Errors: Check .env variables (PGHOST, etc.) match context (Docker vs. Host vs. Bare-metal) and firewall/pg_hba.conf rules.
Globus Auth Errors: Ensure Redirect URIs match in Globus Developer portal and .env credentials are correct.
Compute Endpoint Issues: Check endpoint logs (~/.globus_compute/<endpoint_name>/endpoint.log) for function execution errors, environment problems, or connection issues.
500 Server Errors on Gateway: Check gateway logs (docker-compose logs inference-gateway or Gunicorn log files) for Python tracebacks.

Name		Name	Last commit message	Last commit date
Latest commit History 535 Commits
.github/workflows		.github/workflows
compute-endpoints		compute-endpoints
compute-functions		compute-functions
cron_jobs		cron_jobs
dashboard_async		dashboard_async
examples/load-testing		examples/load-testing
fixtures		fixtures
inference_gateway		inference_gateway
resource_server		resource_server
resource_server_async		resource_server_async
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
first_architecture.png		first_architecture.png
gateway_async.service		gateway_async.service
generate_auth_token.py		generate_auth_token.py
gunicorn_asgi.config.py		gunicorn_asgi.config.py
inference-auth-token.py		inference-auth-token.py
inference_gateway_architecture_focused.png		inference_gateway_architecture_focused.png
logging_config.py		logging_config.py
manage.py		manage.py
materialized_views.sql		materialized_views.sql
materialized_views_batch.sql		materialized_views_batch.sql
metrics_processing.sql		metrics_processing.sql
nginx_app.conf		nginx_app.conf
pgloader.load		pgloader.load
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
reset_cursor.py		reset_cursor.py
schema.yml		schema.yml

License

auroraGPT-ANL/inference-gateway

Folders and files

Latest commit

History

Repository files navigation

Inference Gateway for FIRST toolkit

Table of Contents

System Architecture

Prerequisites

Setup Overview

Gateway Setup

Installation (Docker or Bare Metal)

Register Two Globus Applications

Configure Environment (.env)

Initialize Gateway Database

Inference Backend Setup (Remote/Local)

Virtual Python Environment

Install Inference Server (e.g., vLLM) and Globus Compute

Register Globus Compute Functions

Configure a Globus Compute Endpoint

Note on HPC Configuration Examples:

Connecting Gateway and Backend

Update Fixtures

Load Fixtures

Starting the Services

Gateway (Docker or Bare Metal)

Inference Backend (Globus Compute Endpoint)

Verifying the Setup

Setting up ngrok for Testing Streaming (Optional)

Benchmarking

Production Considerations (Nginx)

Monitoring

Example Usage Documentation

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages