Let’s Talk
Contact us today to discuss how we can save you time, money and stress!
This NVIDIA H200 DeepSeek-V3 benchmark report evaluates the performance of the model running on an SGLang server using 8x NVIDIA H200 GPUs and 2x AMD EPYC 9654 CPUs (192 cores total). The goal was to analyze latency, throughput, and scalability under different request loads.
System Configuration
SGLang server was launched with:
python3 -m sglang.launch_server –model deepseek-ai/DeepSeek-V3 –tp 8 –trust-remote-code –enable-torch-compile –mem-fraction-static 0.8 –disable-cuda-graph –enable-dp-attention
Parameters:
Benchmark was launched with:
python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 300 –request-rate 1 –random-input 1024 –random-output 1024
python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 600 –request-rate 2 –random-input 1024 –random-output 1024
python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 1200 –request-rate 4 –random-input 1024 –random-output 1024
python3 -m sglang.bench_serving –backend sglang –dataset-name random –random-range-ratio 1 –num-prompt 2400 –request-rate 8 –random-input 1024 –random-output 1024
Each benchmark test used random input and output lengths of 1024 tokens, simulating real-world workloads with varying request rates. The tests measured key performance metrics related to token generation speed and efficiency:
Request Rate (per second) | Requests Completed | Median TTFT (ms) | Median ITL (ms) | Median TPOT (ms) | Output Token Throughput (per second) |
---|---|---|---|---|---|
1 | 300 | 942 | 185 | 229 | 608 |
2 | 600 | 928 | 189 | 246 | 1,244 |
4 | 1,200 | 1,148 | 224 | 302 | 2,398 |
8 | 2,400 | 1,944 | 248 | 511 | 2,288 |
Latency Analysis:
The Time To First Token (TTFT) decreases slightly between a request rate of 1 and 2 but increases at higher loads (4 and 8 RPS). This suggests that while the system scales well, latency starts to increase due to resource contention at higher loads.
Inter-Token Latency (ITL) follows a similar trend, increasing as the request rate grows, indicating that the per-token processing time is affected by concurrency overhead.
Throughput Trends:
Output token throughput increases significantly with higher request rates, peaking at 2398 tokens/sec at 4 RPS before slightly dropping at 8 RPS (2288 tokens/sec). This suggests the system reaches peak efficiency around 4 RPS before memory bandwidth or compute limitations impact performance.
The drop at 8 RPS could be a result of increased scheduling overhead or memory contention across multiple GPUs.
Scalability Observations:
While the system scales well from 1 to 4 RPS, performance gains diminish beyond 4 RPS, highlighting a potential optimization opportunity for parallelism or memory management.
The NVIDIA H200 GPUs demonstrate strong performance in running DeepSeek-V3 inference workloads with excellent token throughput and competitive latency metrics. The setup efficiently scales up to 4 RPS, beyond which diminishing returns are observed due to system constraints.
Further optimizations could enhance throughput at higher concurrency levels, particularly in reducing TTFT and ITL at peak loads.
This benchmark underscores the NVIDIA H200’s capabilities for large-scale AI workloads, inference tasks, and high-throughput applications in production environments.
Looking forward, refining kernel tuning and optimizing inference engines such as SGLang will likely unlock even greater scalability and throughput. These advancements have the potential to redefine efficiency in high-performance LLM deployment.
Contact us today to discuss how we can save you time, money and stress!
Copyright © 2024 ionstream All rights reserved