Running DeepSeek-V4-Flash at 700 tokens/s on 2x RTX Pro 6000
Run DeepSeek-V4-Flash on a 2x RTX Pro 6000 (96GB each) workstation using the voipmonitor/vllm:lucifer Docker image, a Blackwell-targeted vLLM fork with sm_120 kernels, FP8 KV cache, and MTP speculative decoding.