Usage Guide

This guide covers training models and running inference with the llm project.

For installation instructions, see the README. For development setup (testing, linting, Docker), see the Development Guide.

Training

Using the CLI

The recommended way to train models is using the llm-train CLI:

llm-train --task <task_name> [options]

Use --help to see all available options:

llm-train --help

Examples:

# Regression task (uses synthetic data, works out of the box)
llm-train --task regression --epochs 10 --batch-size 32

# Language modeling task (requires dataset configuration)
# Note: The lm task uses TextDataModule which needs a configured dataset.
# Use a config file or the standalone script below for LM training.
llm-train --task lm --config-path configs/example.yaml --epochs 5

[!NOTE] The --task lm option requires dataset configuration via a YAML config file. For quick experimentation, use the standalone decoder training script instead.

Standalone Simple Decoder Training

For a simple example of training a decoder-only model on a text file:

uv run scripts/train_simple_decoder.py --file-path data/dummy_corpus.txt --epochs 5

Common Options:

--file-path: Path to the training text file (Required)
--val-file-path: Path to the validation text file
--device: cpu or cuda (auto-detect by default)
--epochs, --batch-size, --lr: Training hyperparameters

Inference

To generate text using a trained model, you can use the generate function from src/llm/inference.py.

Python API Example (Simple)

Here's a basic example using the built-in SimpleCharacterTokenizer:

import torch
from llm.models.decoder import DecoderModel
from llm.tokenization.simple_tokenizer import SimpleCharacterTokenizer
from llm.inference import generate

# 1. Create a simple tokenizer
corpus = ["Hello world", "This is a test"]
tokenizer = SimpleCharacterTokenizer(corpus)

# 2. Initialize Model
# Configuration should match the tokenizer's vocab size
model = DecoderModel(
    vocab_size=tokenizer.vocab_size,
    hidden_size=64,
    num_layers=2,
    num_heads=4,
    max_seq_len=128,
)
# Load weights here if available
# model.load_state_dict(torch.load("path/to/model.pt"))

# 3. Generate Text
generated_text = generate(
    model=model,
    tokenizer=tokenizer,
    prompt="Hello",
    max_new_tokens=20,
    temperature=0.8
)
print(generated_text)

Python API Example (HuggingFace)

For production use with pre-trained tokenizers, use HFTokenizer:

import torch
from llm.models.decoder import DecoderModel
from llm.tokenization.tokenizer import HFTokenizer
from llm.inference import generate

# 1. Load Tokenizer (e.g., GPT-2 from HuggingFace)
# Note: This requires transformers library: pip install transformers
tokenizer = HFTokenizer.from_pretrained("gpt2")

# 2. Initialize Model
# IMPORTANT: vocab_size must match the tokenizer
model = DecoderModel(
    vocab_size=tokenizer.vocab_size,  # GPT-2: 50257
    hidden_size=768,
    num_layers=12,
    num_heads=12,
    max_seq_len=1024,
)

# 3. Load trained weights (if available)
# checkpoint = torch.load("path/to/checkpoint.pt")
# model.load_state_dict(checkpoint["model_state_dict"])

# 4. Generate Text
generated_text = generate(
    model=model,
    tokenizer=tokenizer,
    prompt="Once upon a time",
    max_new_tokens=50,
    temperature=0.9,
    top_p=0.95
)
print(generated_text)

Inference Serving

This project includes a production-ready REST API for inference service, built with FastAPI.

Features

Streaming Support: Server-Sent Events (SSE) for real-time token generation.
Advanced Sampling: Support for top_p (Nucleus Sampling) and repetition_penalty.
Production Ready: Structured logging, Prometheus metrics, and API Key authentication.

Starting the Server

Using CLI (Recommended):

llm-serve

Using Docker:

make image
make compose-up

API Usage

POST /generate

Generate text from a prompt.

Request Body:

{
  "prompt": "Hello, world",
  "max_new_tokens": 50,
  "temperature": 0.8,
  "top_k": 5,
  "top_p": 0.9,
  "repetition_penalty": 1.1,
  "stream": false
}

Streaming Request:

Set "stream": true to receive a stream of tokens.

curl -X POST "http://127.0.0.1:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "Tell me a story", "stream": true}'

Response (Non-streaming):

{
  "generated_text": "Hello, world! This is a generated text...",
  "token_count": 12
}

Authentication

If LLM_SERVING_API_KEY is set, you must provide the key in the X-API-Key header.

export LLM_SERVING_API_KEY="my-secret-key"
# Start server...

curl -X POST "http://127.0.0.1:8000/generate" \
     -H "X-API-Key: my-secret-key" \
     ...

Configuration

You can configure the serving engine using environment variables:

Variable	Description	Default
`LLM_SERVING_MODEL_PATH`	Path to model checkpoint file	`None` (Dummy Model)
`LLM_SERVING_DEVICE`	Computation device (`cpu`, `cuda`, `auto`)	`auto`
`LLM_SERVING_API_KEY`	API Key for authentication	`None` (Disabled)
`LLM_SERVING_LOG_LEVEL`	Logging level (`INFO`, `DEBUG`, etc.)	`INFO`

Metrics

Prometheus metrics are available at /metrics.

curl http://127.0.0.1:8000/metrics

Performance Benchmarking

A benchmark script is provided to measure inference performance (Latency and TPS).

# Run benchmark with torch.compile enabled
uv run scripts/benchmark_inference.py --compile --runs 5

Arguments:

--runs: Number of benchmark iterations.
--compile: Enable torch.compile optimization.
--device: Target device (e.g., cuda).
--max_new_tokens: Number of tokens to generate per run.

OpenAI-Compatible API

The serving module provides an OpenAI-compatible endpoint, allowing you to use the official openai Python SDK.

Endpoint

POST /v1/chat/completions

Request Body

{
  "model": "llm",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 50,
  "temperature": 0.8,
  "stream": false
}

Parameters:

Parameter	Type	Default	Description
`model`	string	`"llm"`	Model identifier (ignored, for compatibility)
`messages`	array	required	List of chat messages
`max_tokens`	int	`50`	Maximum tokens to generate
`temperature`	float	`1.0`	Sampling temperature (0-2)
`top_p`	float	`null`	Nucleus sampling parameter
`stream`	bool	`false`	Enable streaming response
`presence_penalty`	float	`0.0`	Mapped to repetition_penalty

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704556800,
  "model": "llm",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "Hello! How can I help?"},
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Authentication

Supports both X-API-Key header and Bearer token:

# Using X-API-Key
curl -X POST "http://localhost:8000/v1/chat/completions" \
     -H "X-API-Key: your-key" \
     -H "Content-Type: application/json" \
     -d '{"messages": [{"role": "user", "content": "hi"}]}'

# Using Bearer token (OpenAI SDK style)
curl -X POST "http://localhost:8000/v1/chat/completions" \
     -H "Authorization: Bearer your-key" \
     -H "Content-Type: application/json" \
     -d '{"messages": [{"role": "user", "content": "hi"}]}'

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-key"  # or any string if auth is disabled
)

response = client.chat.completions.create(
    model="llm",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=50,
    temperature=0.8
)

print(response.choices[0].message.content)

Streaming with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="test")

stream = client.chat.completions.create(
    model="llm",
    messages=[{"role": "user", "content": "Tell me a story"}],
    max_tokens=100,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)