Lemonade Server Integration
Set up local LLM inference with Lemonade Server for VS Code extensions
Lemonade Server Integration
Lemonade Server provides a local OpenAI-compatible API endpoint that VS Code extensions (Kilo Code, Continue, Cline) can use for LLM inference. It runs as a systemd service and supports GPU acceleration via ROCm, CUDA, or Vulkan.
Overview
┌─────────────────────────────────────────────────────────────┐
│ Host Machine │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Lemonade Server (:13305) │ │
│ │ │ │
│ │ /v1/chat/completions → OpenAI-compatible API │ │
│ │ /v1/embeddings → Embedding API │ │
│ │ /api/v1/load → Model management │ │
│ │ /api/v1/models → List available models │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ llama.cpp backend │ │ │
│ │ │ (ROCm / CUDA / Vulkan / CPU) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ HTTP │
│ ┌───────────────────────────┴───────────────────────────┐ │
│ │ VS Code Sandboxes │ │
│ │ │ │
│ │ Kilo Code ──┐ │ │
│ │ Continue ───┼──→ http://host:13305/v1/... │ │
│ │ Cline ───┘ │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘Quick Setup
One-Command Setup
bash ./scripts/setup-lemonade.sh \
--groups groups.yaml \
--generate-keys \
--external-ip YOUR_IPThis command:
- Installs Lemonade Server from PPA
- Configures GPU backend (auto-detected)
- Generates API keys
- Downloads the default chat model (gemma-4-31b-it)
- Downloads the default embedding model (harrier-oss-v1-0.6b)
- Creates
kilo.jsonfor Kilo Code (with indexing config)
Without Embedding Model
bash ./scripts/setup-lemonade.sh \
--groups groups.yaml \
--generate-keys \
--external-ip YOUR_IP \
--no-embeddingManual Setup
Click to expand manual setup steps
# 1. Install from PPA
sudo add-apt-repository -y ppa:lemonade-team/stable
sudo apt-get update
sudo apt-get install -y lemonade-server
# 2. Configure
lemonade config set host=0.0.0.0 port=13305
# 3. Generate API keys
API_KEY=$(openssl rand -base64 32 | tr -d '/+=\n' | head -c 32)
ADMIN_KEY=$(openssl rand -base64 32 | tr -d '/+=\n' | head -c 32)
sudo mkdir -p /etc/systemd/system/lemonade-server.service.d
sudo tee /etc/systemd/system/lemonade-server.service.d/override.conf <<EOF
[Service]
Environment="LEMONADE_API_KEY=${API_KEY}"
Environment="LEMONADE_ADMIN_API_KEY=${ADMIN_KEY}"
EOF
sudo systemctl daemon-reload
# 4. Start server
sudo systemctl restart lemonade-server
# 5. Pull chat model
lemonade pull unsloth/gemma-4-31B-it-GGUF:Q8_K_XL
# 6. Pull embedding model (optional, for semantic indexing)
lemonade pull SuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16Service Management
# Check status
sudo systemctl status lemonade-server
# View logs
sudo journalctl -u lemonade-server -f
# Restart
sudo systemctl restart lemonade-server
# Stop
sudo systemctl stop lemonade-serverConfiguration
Config File Location
/var/lib/lemonade/.cache/lemonade/
├── config.json # Server settings
├── user_models.json # User-registered models
├── server_models.json # Server-suggested models
└── recipe_options.json # Per-model runtime settingsServer Configuration (config.json)
{
"port": 13305,
"host": "0.0.0.0",
"log_level": "info",
"max_loaded_models": 2,
"ctx_size": 262144,
"llamacpp": {
"backend": "auto",
"prefer_system": true,
"rocm_bin": "/usr/local/bin/llama-server",
"vulkan_bin": "/usr/local/bin/llama-server",
"cpu_bin": "/usr/local/bin/llama-server"
}
}When embedding is enabled, max_loaded_models is automatically set to 2
to allow both the chat model and embedding model to be loaded simultaneously.
Backend Options
| Backend | Description | Use Case |
|---|---|---|
auto | Auto-detect GPU | Recommended (default) |
vulkan | Cross-platform GPU | AMD/NVIDIA without ROCm/CUDA |
cpu | CPU-only | No GPU available |
::: warning
rocm is not a valid llamacpp_backend value. Use auto to enable ROCm
auto-detection.
:::
Model Configuration (user_models.json)
With both chat and embedding models:
{
"gemma-4-31b-it": {
"model_name": "gemma-4-31b-it",
"checkpoint": "unsloth/gemma-4-31B-it-GGUF:Q8_K_XL",
"recipe": "llamacpp",
"suggested": true,
"labels": ["custom", "vision"],
"mmproj": "mmproj-BF16.gguf"
},
"harrier-oss-v1-0.6b": {
"model_name": "harrier-oss-v1-0.6b",
"checkpoint": "SuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16",
"recipe": "llamacpp",
"suggested": true,
"labels": ["custom", "embedding"]
}
}Runtime Options (recipe_options.json)
{
"user.gemma-4-31b-it": {
"ctx_size": 1572864,
"llamacpp_backend": "auto",
"llamacpp_args": "-b 8192 -ub 8192 -to 3600 -ctk q8_0 -ctv q8_0 --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --no-webui --threads-http -1 --threads -1 -np 6"
},
"user.harrier-oss-v1-0.6b": {
"ctx_size": 196608,
"llamacpp_backend": "auto",
"llamacpp_args": "-b 8192 -ub 8192 -to 3600 -ctk q8_0 -ctv q8_0 --no-webui --threads-http -1 --threads -1 -np 6"
}
}Embedding Model
The embedding model enables Kilo Code's semantic code search feature. It runs alongside the chat model in the same Lemonade server instance.
Default Embedding Model
| Property | Value |
|---|---|
| Checkpoint | SuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16 |
| Short name | harrier-oss-v1-0.6b (API: user.harrier-oss-v1-0.6b) |
| Recipe | llamacpp |
| Labels | ["custom", "embedding"] |
Enabling / Disabling
Embedding is enabled by default. To disable:
# Shell script
bash setup-lemonade.sh --no-embedding --generate-keys --external-ip 1.2.3.4
# Python CLI
python lemonade_server.py run --no-embedding --generate-keys --external-ip 1.2.3.4When disabled:
- No embedding model entry in
user_models.jsonorrecipe_options.json - Embedding model is not pulled or loaded
max_loaded_modelsstays at1kilo.jsonomits theindexingsection
Custom Embedding Model
bash setup-lemonade.sh \
--embedding-model some-org/embedding-model-GGUF:Q8_0 \
--embedding-model-name my-embedding \
--generate-keys \
--external-ip 1.2.3.4Per-User Scaling
When using --groups groups.yaml, context and parallelism scale automatically
for both chat and embedding models:
| Parameter | Chat Model | Embedding Model |
|---|---|---|
ctx_size | 262144 per user | 32768 per user |
-np | num_users | num_users |
Per-slot ctx_size | 262144 | 32768 |
Each user gets a full 262,144 token context window for chat and 32,768 for embedding.
Reserved Arguments
These arguments are managed by Lemonade and cannot be in llamacpp_args:
--ctx-size, -c, -ngl, --gpu-layers, --n-gpu-layers, --jinja, --no-jinja,
--model, -m, --port, --embedding, --embeddings, --mmproj*, --rerank*API Keys
| Variable | Scope | Location |
|---|---|---|
LEMONADE_API_KEY | Regular endpoints (/api/*, /v1/*) | systemd override |
LEMONADE_ADMIN_API_KEY | All endpoints including /internal/* | systemd override |
Using API Keys
# CLI
LEMONADE_API_KEY=your_key lemonade pull model-name
# HTTP
curl -H "Authorization: Bearer your_key" http://localhost:13305/v1/chat/completionsCustom llama.cpp Build
AMD MI300X (gfx942)
bash ./build-amd-mi300x-llama-server.shThis builds llama.cpp with:
- ROCm/HIP backend
gfx942target (MI300X)- Installs to
/usr/local/bin/llama-server
The default Lemonade config uses prefer_system: true with this binary.
Other GPUs
Modify build-amd-mi300x-llama-server.sh for your architecture:
# Change AMDGPU_TARGETS for your GPU
-DAMDGPU_TARGETS=gfx1100 # RX 7900 series
-DAMDGPU_TARGETS=gfx1030 # RX 6000/7000 seriesKilo Code Integration
Generate Config
bash ./scripts/setup-lemonade.sh \
--groups groups.yaml \
--generate-keys \
--external-ip YOUR_IPCreates kilo.json with chat model, experimental flags, and semantic indexing:
{
"provider": {
"lemonade": {
"models": {
"user.gemma-4-31b-it": {
"name": "unsloth/gemma-4-31B-it-GGUF:Q8_K_XL",
"limit": {
"context": 262144,
"output": 4096
}
}
},
"options": {
"apiKey": "your-api-key",
"baseURL": "http://YOUR_IP:13305/v1"
}
}
},
"model": "lemonade/user.gemma-4-31b-it",
"experimental": {
"batch_tool": false,
"codebase_search": true,
"openTelemetry": false,
"continue_loop_on_deny": true,
"semantic_indexing": true,
"agent_manager_tool": true
},
"indexing": {
"enabled": true,
"provider": "openai-compatible",
"vectorStore": "lancedb",
"openai-compatible": {
"baseUrl": "http://YOUR_IP:13305/v1",
"apiKey": "your-api-key",
"model": "user.harrier-oss-v1-0.6b"
}
}
}Generate Config Without Embedding
python lemonade_server.py generate-kilo-config \
--admin-api-key YOUR_KEY \
--external-ip 1.2.3.4 \
--no-embeddingInject into Sandboxes
python ./scripts/main.py \
--groups groups.yaml \
--external-ip YOUR_IP \
--lemonade kilo.jsonInjects config to /home/vscode/.config/kilo/config.json in each sandbox.
CLI Reference
setup-lemonade.sh Options
| Option | Default | Description |
|---|---|---|
--groups FILE | (none) | groups.yaml for user count |
--group GROUP | (all) | Filter to single group |
--num-users N | 1 | Override parallel user count |
--port PORT | 13305 | Server port |
--host HOST | 0.0.0.0 | Bind address |
--backend BACKEND | auto | llama.cpp backend |
--ctx-size SIZE | 262144 | Per-user context size |
--model MODEL | unsloth/gemma-4-31B-it-GGUF:Q8_K_XL | HuggingFace checkpoint |
--model-name NAME | gemma-4-31b-it | Short model name |
--mmproj FILE | mmproj-BF16.gguf | Vision mmproj filename |
--external-ip IP | (auto) | External IP for kilo.json |
--generate-keys | false | Generate API keys |
--no-prefer-system | (system) | Use bundled llama.cpp |
--llamacpp-bin PATH | /usr/local/bin/llama-server | System binary path |
--kilo-config PATH | ./kilo.json | Output path for kilo.json |
--embedding | true | Enable embedding model |
--no-embedding | false | Disable embedding model |
--embedding-model MODEL | SuperPauly/harrier-oss-v1-0.6b-gguf:harrier-oss-v1-0.6B-BF16 | Embedding model checkpoint |
--embedding-model-name NAME | harrier-oss-v1-0.6b | Short name for embedding model |
lemonade CLI Commands
# Pull model
lemonade pull org/repo:variant
# List models
lemonade list
# Configure
lemonade config set key=value
# Run inference
lemonade run user.model-nameTroubleshooting
Model Not Found (404)
Ensure model name includes user. prefix in API requests:
- ✅
user.gemma-4-31b-it - ❌
gemma-4-31b-it
Reserved Argument Error
Remove reserved args from llamacpp_args. Lemonade manages:
- GPU layers (
-ngl) - Context size (
--ctx-size) - Jinja formatting (
--jinja) - Model path (
--model) - Embedding flag (
--embedding)
Backend Not Detected
# Check GPU
rocminfo # AMD
nvidia-smi # NVIDIA
# Force backend
lemonade config set llamacpp.backend=vulkan
sudo systemctl restart lemonade-serverMemory Issues
Reduce context size or use smaller quantization:
# Smaller context
lemonade config set ctx_size=131072
# Smaller model
lemonade pull unsloth/gemma-4-31B-it-GGUF:Q4_K_MEmbedding Model Not Loading
If the embedding model fails to load:
- Check
max_loaded_modelsis at least2inconfig.json - Verify GPU memory can support both models
- Check logs:
sudo journalctl -u lemonade-server -f - Try disabling embedding:
--no-embeddingflag