← back to dooner.tech

⚙ AI Benchmarks

homelab benchmark notes · PVE03

PVE03 · Xeon Gold 6240L · 192GB RAM
2
RTX PRO 6000 Blackwell
~192 GB
VRAM total
1,048,576
Max context tokens
6
Inference containers

Chat / LLM

DeepSeek-V4 (DSparK Speculation)
active · 2x GPU · :8000
GPU0 96.9 GB
used of 97.9 GB
GPU1 96.8 GB
used of 97.9 GB
~99%
VRAM utilization
vLLM launch args:
vllm serve /models/DeepSeek-V4-Flash-DSpark \
--served-model-name DeepSeek-V4 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.94 \
--max-model-len 1048576 \
--kv-cache-dtype fp8 \
--block-size 256 \
--trust-remote-code \
--max-num-seqs 32 \
--enable-chunked-prefill \
--enable-flashinfer-autotune \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--reasoning-parser deepseek_v4 \
--enable-auto-tool-choice \
--attention-backend FLASHINFER_MLA_SPARSE_DSV4 \
--speculative-config '{"model":"...DeepSeek-V4-Flash-DSpark","method":"dspark","num_speculative_tokens":5,"draft_sample_method":"probabilistic"}'
Key env vars:
VLLM_ENABLE_PCIE_ALLREDUCE=1
VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_USE_AOT_COMPILE=1
VLLM_USE_MEGA_AOT_ARTIFACT=1
VLLM_CACHE_DIR=/cache/vllm
VLLM_NCCL_SO_PATH=/opt/libnccl-local-inference.so.2.30.4

Speech

Speaches (TTS / STT)
active · CPU · :8012
docker run -d --name speaches-cpu \
-p 8012:8000 \
ghcr.io/speaches-ai/speaches:latest-cpu \
uvicorn --factory speaches.main:create_app
Whisper speech-to-text + Kokoro text-to-speech, all CPU-based.

Embeddings

Qwen3 Embedding (0.6B)
active · CPU · :8010
docker run -d --name tei-embed-fast \
-p 8010:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id Qwen/Qwen3-Embedding-0.6B \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8
BGE-M3 (Multilingual)
active · CPU · :8013
docker run -d --name tei-bge-m3 \
-p 8013:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-m3 \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8

Rerankers

BGE Reranker Base (Fast)
active · CPU · :8014
docker run -d --name tei-rerank-fast \
-p 8014:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-reranker-base \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8
BGE Reranker Large (Better)
active · CPU · :8016
docker run -d --name tei-rerank-better \
-p 8016:80 \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id BAAI/bge-reranker-large \
--max-batch-tokens 2048 \
--max-client-batch-size 8 \
--tokenization-workers 8

Architecture

┌────────────────────────────────────────────────━┐
Hermes Agent / Caddy
(calls DeepSeek-V4 via API on PVE03)
└───────────────────┬────────────────────────────────┘
Tailscale / LAN
┌───────────────────┴────────────────────────────────┐
PVE03 — Inference Host

DeepSeek-V4 Speaches Embeddings
:8000 (GPU x2) :8012 (CPU) :8010 Qwen3
:8013 BGE-M3
Rerankers
:8014 Base
:8016 Large
└─────────────────────────────────────────────────────────┘
↗ full write-up on the blog →
← back to dooner.tech