vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-4.7

GLM-4.7 MoE language model (~358B total parameters) with MTP speculative decoding, updated tool call parser, and reasoning support

View on HuggingFace
moe358B / 32B202,752 ctxvLLM 0.11.0+text
Guide

Overview

GLM-4.7 is the latest GLM-4.X MoE release from Z-AI. It introduces the glm47 tool call parser while retaining the GLM-4.5 reasoning parser. Built-in Multi-Token Prediction (MTP) layers enable speculative decoding for throughput gains on decode-heavy workloads.

A smaller zai-org/GLM-4.7-Flash variant is also available for lower-latency scenarios.

Prerequisites

  • vLLM version: nightly recommended for GLM-4.7 (until packaged in a stable release)
  • Hardware: 4x-8x H200 (FP8), AMD MI300X / MI325X / MI355X for ROCm
  • Python: 3.10 - 3.13 (3.12 required for ROCm wheels)

Install vLLM (NVIDIA, nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
uv pip install git+https://github.com/huggingface/transformers.git

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

Launching the Server

Tensor Parallel + MTP (FP8 on 4xH200)

vllm serve zai-org/GLM-4.7-FP8 \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

AMD ROCm

SAFETENSORS_FAST_GPU=1 \
vllm serve zai-org/GLM-4.7 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --disable-log-requests \
    --no-enable-prefix-caching \
    --trust-remote-code

Tuning Tips

  • --max-model-len=65536 is a sensible default; max is 128K.
  • --max-num-batched-tokens=32768 for prompt-heavy workloads; reduce to 8K-16K for latency-sensitive.
  • Use --gpu-memory-utilization=0.95 to maximize KV cache.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Benchmarking

vllm bench serve \
  --model zai-org/GLM-4.7-FP8 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Troubleshooting

  • Parser mismatch: GLM-4.7 uses --tool-call-parser glm47 (not glm45).
  • MTP acceptance: 1 speculative token gives ~90%+ acceptance and best throughput.

References

Configuration Matrix
VariantPrecisionMin VRAMNotes
DefaultBF16859 GBFull precision BF16 on 8xH200 or equivalent
FP8FP8430 GBNative FP8 checkpoint with minimal accuracy loss
NVFP4NVFP4215 GBNVIDIA NVFP4 quantized weights for Blackwell GPUs