The Hackathon Organizer Node

Lemonade Server Integration

Set up local LLM inference with Lemonade Server for VS Code extensions

Lemonade Server Integration

Lemonade Server provides a local OpenAI-compatible API endpoint that VS Code extensions (Kilo Code, Continue, Cline) can use for LLM inference. It runs as a systemd service and supports GPU acceleration via ROCm, CUDA, or Vulkan.

Overview

┌─────────────────────────────────────────────────────────────┐
│                        Host Machine                          │
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │              Lemonade Server (:13305)                  │  │
│  │                                                       │  │
│  │   /v1/chat/completions  → OpenAI-compatible API      │  │
│  │   /v1/embeddings        → Embedding API              │  │
│  │   /api/v1/load          → Model management           │  │
│  │   /api/v1/models        → List available models      │  │
│  │                                                       │  │
│  │   ┌─────────────────────────────────────────────┐    │  │
│  │   │            llama.cpp backend                 │    │  │
│  │   │   (ROCm / CUDA / Vulkan / CPU)              │    │  │
│  │   └─────────────────────────────────────────────┘    │  │
│  └───────────────────────────────────────────────────────┘  │
│                              ▲                               │
│                              │ HTTP                          │
│  ┌───────────────────────────┴───────────────────────────┐  │
│  │                 VS Code Sandboxes                      │  │
│  │                                                        │  │
│  │   Kilo Code ──┐                                       │  │
│  │   Continue  ───┼──→ http://host:13305/v1/...         │  │
│  │   Cline     ───┘                                       │  │
│  └────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Quick Setup

One-Command Setup

bash ./scripts/setup-lemonade.sh \
    --groups groups.yaml \
    --generate-keys \
    --external-ip YOUR_IP

This command:

  1. Installs Lemonade Server from PPA
  2. Configures GPU backend (auto-detected)
  3. Generates API keys
  4. Downloads the default chat model (gemma-4-31b-it)
  5. Downloads the default embedding model (harrier-oss-v1-0.6b)
  6. Creates kilo.json for Kilo Code (with indexing config)

Without Embedding Model

bash ./scripts/setup-lemonade.sh \
    --groups groups.yaml \
    --generate-keys \
    --external-ip YOUR_IP \
    --no-embedding

Manual Setup

Click to expand manual setup steps
# 1. Install from PPA
sudo add-apt-repository -y ppa:lemonade-team/stable
sudo apt-get update
sudo apt-get install -y lemonade-server

# 2. Configure
lemonade config set host=0.0.0.0 port=13305

# 3. Generate API keys
API_KEY=$(openssl rand -base64 32 | tr -d '/+=\n' | head -c 32)
ADMIN_KEY=$(openssl rand -base64 32 | tr -d '/+=\n' | head -c 32)

sudo mkdir -p /etc/systemd/system/lemonade-server.service.d
sudo tee /etc/systemd/system/lemonade-server.service.d/override.conf <<EOF
[Service]
Environment="LEMONADE_API_KEY=${API_KEY}"
Environment="LEMONADE_ADMIN_API_KEY=${ADMIN_KEY}"
EOF
sudo systemctl daemon-reload

# 4. Start server
sudo systemctl restart lemonade-server

# 5. Pull chat model
lemonade pull unsloth/gemma-4-31B-it-GGUF:Q8_K_XL

# 6. Pull embedding model (optional, for semantic indexing)
lemonade pull SuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16

Service Management

# Check status
sudo systemctl status lemonade-server

# View logs
sudo journalctl -u lemonade-server -f

# Restart
sudo systemctl restart lemonade-server

# Stop
sudo systemctl stop lemonade-server

Configuration

Config File Location

/var/lib/lemonade/.cache/lemonade/
├── config.json           # Server settings
├── user_models.json      # User-registered models
├── server_models.json    # Server-suggested models
└── recipe_options.json   # Per-model runtime settings

Server Configuration (config.json)

{
  "port": 13305,
  "host": "0.0.0.0",
  "log_level": "info",
  "max_loaded_models": 2,
  "ctx_size": 262144,
  "llamacpp": {
    "backend": "auto",
    "prefer_system": true,
    "rocm_bin": "/usr/local/bin/llama-server",
    "vulkan_bin": "/usr/local/bin/llama-server",
    "cpu_bin": "/usr/local/bin/llama-server"
  }
}

When embedding is enabled, max_loaded_models is automatically set to 2 to allow both the chat model and embedding model to be loaded simultaneously.

Backend Options

BackendDescriptionUse Case
autoAuto-detect GPURecommended (default)
vulkanCross-platform GPUAMD/NVIDIA without ROCm/CUDA
cpuCPU-onlyNo GPU available

::: warning rocm is not a valid llamacpp_backend value. Use auto to enable ROCm auto-detection. :::

Model Configuration (user_models.json)

With both chat and embedding models:

{
  "gemma-4-31b-it": {
    "model_name": "gemma-4-31b-it",
    "checkpoint": "unsloth/gemma-4-31B-it-GGUF:Q8_K_XL",
    "recipe": "llamacpp",
    "suggested": true,
    "labels": ["custom", "vision"],
    "mmproj": "mmproj-BF16.gguf"
  },
  "harrier-oss-v1-0.6b": {
    "model_name": "harrier-oss-v1-0.6b",
    "checkpoint": "SuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16",
    "recipe": "llamacpp",
    "suggested": true,
    "labels": ["custom", "embedding"]
  }
}

Runtime Options (recipe_options.json)

{
  "user.gemma-4-31b-it": {
    "ctx_size": 1572864,
    "llamacpp_backend": "auto",
    "llamacpp_args": "-b 8192 -ub 8192 -to 3600 -ctk q8_0 -ctv q8_0 --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --no-webui --threads-http -1 --threads -1 -np 6"
  },
  "user.harrier-oss-v1-0.6b": {
    "ctx_size": 196608,
    "llamacpp_backend": "auto",
    "llamacpp_args": "-b 8192 -ub 8192 -to 3600 -ctk q8_0 -ctv q8_0 --no-webui --threads-http -1 --threads -1 -np 6"
  }
}

Embedding Model

The embedding model enables Kilo Code's semantic code search feature. It runs alongside the chat model in the same Lemonade server instance.

Default Embedding Model

PropertyValue
CheckpointSuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16
Short nameharrier-oss-v1-0.6b (API: user.harrier-oss-v1-0.6b)
Recipellamacpp
Labels["custom", "embedding"]

Enabling / Disabling

Embedding is enabled by default. To disable:

# Shell script
bash setup-lemonade.sh --no-embedding --generate-keys --external-ip 1.2.3.4

# Python CLI
python lemonade_server.py run --no-embedding --generate-keys --external-ip 1.2.3.4

When disabled:

  • No embedding model entry in user_models.json or recipe_options.json
  • Embedding model is not pulled or loaded
  • max_loaded_models stays at 1
  • kilo.json omits the indexing section

Custom Embedding Model

bash setup-lemonade.sh \
    --embedding-model some-org/embedding-model-GGUF:Q8_0 \
    --embedding-model-name my-embedding \
    --generate-keys \
    --external-ip 1.2.3.4

Per-User Scaling

When using --groups groups.yaml, context and parallelism scale automatically for both chat and embedding models:

ParameterChat ModelEmbedding Model
ctx_size262144 per user32768 per user
-npnum_usersnum_users
Per-slot ctx_size26214432768

Each user gets a full 262,144 token context window for chat and 32,768 for embedding.

Reserved Arguments

These arguments are managed by Lemonade and cannot be in llamacpp_args:

--ctx-size, -c, -ngl, --gpu-layers, --n-gpu-layers, --jinja, --no-jinja,
--model, -m, --port, --embedding, --embeddings, --mmproj*, --rerank*

API Keys

VariableScopeLocation
LEMONADE_API_KEYRegular endpoints (/api/*, /v1/*)systemd override
LEMONADE_ADMIN_API_KEYAll endpoints including /internal/*systemd override

Using API Keys

# CLI
LEMONADE_API_KEY=your_key lemonade pull model-name

# HTTP
curl -H "Authorization: Bearer your_key" http://localhost:13305/v1/chat/completions

Custom llama.cpp Build

AMD MI300X (gfx942)

bash ./build-amd-mi300x-llama-server.sh

This builds llama.cpp with:

  • ROCm/HIP backend
  • gfx942 target (MI300X)
  • Installs to /usr/local/bin/llama-server

The default Lemonade config uses prefer_system: true with this binary.

Other GPUs

Modify build-amd-mi300x-llama-server.sh for your architecture:

# Change AMDGPU_TARGETS for your GPU
-DAMDGPU_TARGETS=gfx1100  # RX 7900 series
-DAMDGPU_TARGETS=gfx1030  # RX 6000/7000 series

Kilo Code Integration

Generate Config

bash ./scripts/setup-lemonade.sh \
    --groups groups.yaml \
    --generate-keys \
    --external-ip YOUR_IP

Creates kilo.json with chat model, experimental flags, and semantic indexing:

{
  "provider": {
    "lemonade": {
      "models": {
        "user.gemma-4-31b-it": {
          "name": "unsloth/gemma-4-31B-it-GGUF:Q8_K_XL",
          "limit": {
            "context": 262144,
            "output": 4096
          }
        }
      },
      "options": {
        "apiKey": "your-api-key",
        "baseURL": "http://YOUR_IP:13305/v1"
      }
    }
  },
  "model": "lemonade/user.gemma-4-31b-it",
  "experimental": {
    "batch_tool": false,
    "codebase_search": true,
    "openTelemetry": false,
    "continue_loop_on_deny": true,
    "semantic_indexing": true,
    "agent_manager_tool": true
  },
  "indexing": {
    "enabled": true,
    "provider": "openai-compatible",
    "vectorStore": "lancedb",
    "openai-compatible": {
      "baseUrl": "http://YOUR_IP:13305/v1",
      "apiKey": "your-api-key",
      "model": "user.harrier-oss-v1-0.6b"
    }
  }
}

Generate Config Without Embedding

python lemonade_server.py generate-kilo-config \
    --admin-api-key YOUR_KEY \
    --external-ip 1.2.3.4 \
    --no-embedding

Inject into Sandboxes

python ./scripts/main.py \
    --groups groups.yaml \
    --external-ip YOUR_IP \
    --lemonade kilo.json

Injects config to /home/vscode/.config/kilo/config.json in each sandbox.

CLI Reference

setup-lemonade.sh Options

OptionDefaultDescription
--groups FILE(none)groups.yaml for user count
--group GROUP(all)Filter to single group
--num-users N1Override parallel user count
--port PORT13305Server port
--host HOST0.0.0.0Bind address
--backend BACKENDautollama.cpp backend
--ctx-size SIZE262144Per-user context size
--model MODELunsloth/gemma-4-31B-it-GGUF:Q8_K_XLHuggingFace checkpoint
--model-name NAMEgemma-4-31b-itShort model name
--mmproj FILEmmproj-BF16.ggufVision mmproj filename
--external-ip IP(auto)External IP for kilo.json
--generate-keysfalseGenerate API keys
--no-prefer-system(system)Use bundled llama.cpp
--llamacpp-bin PATH/usr/local/bin/llama-serverSystem binary path
--kilo-config PATH./kilo.jsonOutput path for kilo.json
--embeddingtrueEnable embedding model
--no-embeddingfalseDisable embedding model
--embedding-model MODELSuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16Embedding model checkpoint
--embedding-model-name NAMEharrier-oss-v1-0.6bShort name for embedding model

lemonade CLI Commands

# Pull model
lemonade pull org/repo:variant

# List models
lemonade list

# Configure
lemonade config set key=value

# Run inference
lemonade run user.model-name

Troubleshooting

Model Not Found (404)

Ensure model name includes user. prefix in API requests:

  • user.gemma-4-31b-it
  • gemma-4-31b-it

Reserved Argument Error

Remove reserved args from llamacpp_args. Lemonade manages:

  • GPU layers (-ngl)
  • Context size (--ctx-size)
  • Jinja formatting (--jinja)
  • Model path (--model)
  • Embedding flag (--embedding)

Backend Not Detected

# Check GPU
rocminfo  # AMD
nvidia-smi  # NVIDIA

# Force backend
lemonade config set llamacpp.backend=vulkan
sudo systemctl restart lemonade-server

Memory Issues

Reduce context size or use smaller quantization:

# Smaller context
lemonade config set ctx_size=131072

# Smaller model
lemonade pull unsloth/gemma-4-31B-it-GGUF:Q4_K_M

Embedding Model Not Loading

If the embedding model fails to load:

  1. Check max_loaded_models is at least 2 in config.json
  2. Verify GPU memory can support both models
  3. Check logs: sudo journalctl -u lemonade-server -f
  4. Try disabling embedding: --no-embedding flag

On this page