NVIDIA H200 DeepSeek-V3 Benchmark Report 
NVIDIA H200 DeepSeek-V3 Benchmark Report 

Overview 

This NVIDIA H200 DeepSeek-V3 benchmark report evaluates the performance of the model running on an SGLang server using 8x NVIDIA H200 GPUs and 2x AMD EPYC 9654 CPUs (192 cores total). The goal was to analyze latency, throughput, and scalability under different request loads. 

System Configuration 

  • Hardware: 
  • 8x NVIDIA H200 GPUs (80GB VRAM each) 
  • 2x AMD EPYC 9654 CPUs (192 cores, 384 threads) 
  • 1.65 TB RAM 

  • Software: 
  • OS: Ubuntu 20.04.6 LTS (Focal Fossa) 
  • Kernel: 5.15.0-131-generic #141-Ubuntu SMP 
  • Nvidia Driver Version: 565.57.01 CUDA Version: 12.7 
  • SGLang Version: v0.4.2  

Test Setup

SGLang server was launched with: 

python3 -m sglang.launch_server –model deepseek-ai/DeepSeek-V3 –tp 8 –trust-remote-code –enable-torch-compile –mem-fraction-static 0.8 –disable-cuda-graph –enable-dp-attention 

Parameters: 

  • enable-torch-compile: it attempts to compile the model to improve speed and efficiency by reducing Python overhead and fusing operations 
  • mem-fraction-static: controls how much of the GPU memory is statically allocated avoiding fragmentation 
  • disable-cuda-graph: disables the use of CUDA Graphs for compatibility purposes 
  • enable-dp-attention: can improve throughput for large QPS scenarios for Deepseek models 

Benchmark was launched with: 

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 300 –request-rate 1 –random-input 1024 –random-output 1024 

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 600 –request-rate 2 –random-input 1024 –random-output 1024 

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 1200 –request-rate 4 –random-input 1024 –random-output 1024 

python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 2400 –request-rate 8 –random-input 1024 –random-output 1024 

Benchmark results 

Each benchmark test used random input and output lengths of 1024 tokens, simulating real-world workloads with varying request rates. The tests measured key performance metrics related to token generation speed and efficiency: 

  • TTFT (Time To First Token): The time elapsed from the moment an inference request is sent until the model produces its first output token. A lower TTFT indicates faster response times, critical for interactive applications. 
  • ITL (Inter-Token Latency): The time taken between the generation of consecutive tokens in an auto-regressive decoding sequence. ITL measures the step-by-step processing speed, tracking the interval from the completion of token i to the completion of token i+1. 
  • TPOT (Time Per Output Token): The average time required to generate each subsequent token after the first token is produced. This metric reflects the overall efficiency of the model’s token-generation process. 
  • Output Token Throughput: The number of tokens generated per second, indicating how efficiently the model processes requests. Higher throughput means the model can handle larger batch sizes and higher concurrency.  

Request Rate
(per second)
Requests CompletedMedian TTFT
(ms)
Median ITL
(ms)
Median TPOT
(ms)
Output Token Throughput
(per second)
1300942185229608
26009281892461,244
41,2001,1482243022,398
82,4001,9442485112,288

Analysis and Insights 

Latency Analysis

The Time To First Token (TTFT) decreases slightly between a request rate of 1 and 2 but increases at higher loads (4 and 8 RPS). This suggests that while the system scales well, latency starts to increase due to resource contention at higher loads. 

Inter-Token Latency (ITL) follows a similar trend, increasing as the request rate grows, indicating that the per-token processing time is affected by concurrency overhead. 

Throughput Trends: 

Output token throughput increases significantly with higher request rates, peaking at 2398 tokens/sec at 4 RPS before slightly dropping at 8 RPS (2288 tokens/sec). This suggests the system reaches peak efficiency around 4 RPS before memory bandwidth or compute limitations impact performance. 

The drop at 8 RPS could be a result of increased scheduling overhead or memory contention across multiple GPUs. 

Scalability Observations: 

While the system scales well from 1 to 4 RPS, performance gains diminish beyond 4 RPS, highlighting a potential optimization opportunity for parallelism or memory management. 

Conclusion 

The NVIDIA H200 GPUs demonstrate strong performance in running DeepSeek-V3 inference workloads with excellent token throughput and competitive latency metrics. The setup efficiently scales up to 4 RPS, beyond which diminishing returns are observed due to system constraints.

Further optimizations could enhance throughput at higher concurrency levels, particularly in reducing TTFT and ITL at peak loads. 

This benchmark underscores the NVIDIA H200’s capabilities for large-scale AI workloads, inference tasks, and high-throughput applications in production environments.

Looking forward, refining kernel tuning and optimizing inference engines such as SGLang will likely unlock even greater scalability and throughput. These advancements have the potential to redefine efficiency in high-performance LLM deployment. 

Let’s Talk

Contact us today to discuss how we can save you time, money and stress!

Let’s Talk

Copyright © 2024 ionstream All rights reserved