zai-org/GLM-4.7

GLM-4.7 MoE language model (~358B total parameters) with MTP speculative decoding, updated tool call parser, and reasoning support

View on HuggingFace

moe358B / 32B202,752 ctxvLLM 0.11.0+text

Guide

Overview

GLM-4.7 is the latest GLM-4.X MoE release from Z-AI. It introduces the glm47 tool call parser while retaining the GLM-4.5 reasoning parser. Built-in Multi-Token Prediction (MTP) layers enable speculative decoding for throughput gains on decode-heavy workloads.

A smaller zai-org/GLM-4.7-Flash variant is also available for lower-latency scenarios.

Prerequisites

vLLM version: nightly recommended for GLM-4.7 (until packaged in a stable release)
Hardware: 4x-8x H200 (FP8), AMD MI300X / MI325X / MI355X for ROCm
Python: 3.10 - 3.13 (3.12 required for ROCm wheels)

Install vLLM (NVIDIA, nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
uv pip install git+https://github.com/huggingface/transformers.git

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

Launching the Server

Tensor Parallel + MTP (FP8 on 4xH200)

vllm serve zai-org/GLM-4.7-FP8 \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

AMD ROCm

SAFETENSORS_FAST_GPU=1 \
vllm serve zai-org/GLM-4.7 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --disable-log-requests \
    --no-enable-prefix-caching \
    --trust-remote-code

Tuning Tips

--max-model-len=65536 is a sensible default; max is 128K.
--max-num-batched-tokens=32768 for prompt-heavy workloads; reduce to 8K-16K for latency-sensitive.
Use --gpu-memory-utilization=0.95 to maximize KV cache.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-4.7-FP8",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Benchmarking

vllm bench serve \
  --model zai-org/GLM-4.7-FP8 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Troubleshooting

Parser mismatch: GLM-4.7 uses --tool-call-parser glm47 (not glm45).
MTP acceptance: 1 speculative token gives ~90%+ acceptance and best throughput.

References

Configuration Matrix

Variant	Precision	Min VRAM	Notes
Default	BF16	859 GB	Full precision BF16 on 8xH200 or equivalent
FP8	FP8	430 GB	Native FP8 checkpoint with minimal accuracy loss
NVFP4	NVFP4	215 GB	NVIDIA NVFP4 quantized weights for Blackwell GPUs