Built for AI Training

DeepSeek Redefines Throughput and Latency with Cross-Node Expert Parallelism

Cross-Node Expert Parallelism: DeepSeek’s Leap in Throughput and Latency Efficiency

As large language models continue to scale, traditional parallelism methods increasingly fall short—leading to slower inference speeds and rising computational costs. DeepSeek-V3/R1 addresses this challenge with a novel approach: Cross-Node Expert Parallelism (EP). By distributing workloads intelligently across multiple GPUs—without overburdening any single node—this system delivers substantial improvements in both throughput and latency compared to conventional frameworks like vLLM.

Performance at Scale

DeepSeek-V3/R1 achieves an average input throughput of 73.7K tokens per second per H800 node (including cache hits) and an output throughput of 14.8K tokens per second during decoding, significantly outperforming standard vLLM configurations. This is made possible through a combination of large-scale cross-node EP, a prefill-decode disaggregation architecture, and advanced optimization strategies, including communication-computation overlap and intelligent load balancing.

Design Objective: Maximum Throughput, Minimal Latency

At the heart of DeepSeek’s inference system lies a clear goal: higher throughput, lower latency. Cross-Node Expert Parallelism is the core technology enabling this, as it:

  • Scales batch sizes to improve GPU matrix computation efficiency;
  • Distributes expert layers across GPUs to reduce memory access and latency.

However, implementing EP at scale introduces complexities—especially around cross-node communication and load balancing. DeepSeek tackles these with a carefully engineered system that:

  1. Scales batch sizes to match expert sparsity,
  2. Overlaps communication and computation to eliminate bottlenecks,
  3. Balances workloads dynamically across GPUs to prevent underutilization.

Large-Scale Cross-Node Expert Parallelism (EP)

DeepSeek-V3/R1 activates only 8 out of 256 experts per layer, requiring extremely large batch sizes to ensure each expert processes enough tokens for optimal efficiency. To manage this, DeepSeek employs different parallelism strategies for the prefill and decode phases:

  • Prefill Phase: Routed Expert EP32 + Shared Expert DP32 across 4 nodes. Each GPU handles 9 routed experts and 1 shared expert.
  • Decode Phase: Routed Expert EP144 + Shared Expert DP144 across 18 nodes. Each GPU manages 2 routed experts and 1 shared expert.

This dual-stage design maximizes parallelism while minimizing computational overhead per GPU.


Communication-Computation Overlap

Cross-node EP naturally introduces significant communication overhead. To mitigate this, DeepSeek employs a dual-microbatch strategy:

  • During prefilling, microbatches are executed alternately, hiding communication latency behind computation.
  • During decoding, attention layers are subdivided into two steps and executed in a 5-stage pipeline, enabling seamless overlap of communication and computation.

For more technical details, DeepSeek has published further documentation at GitHub.


Load Balancing Across Nodes

Uneven loads across GPUs can create bottlenecks, reducing overall system performance. DeepSeek implements multi-layered load balancing to ensure smooth operations:

  1. Prefill Load Balancer
    • Equalizes input token distribution and core-attention computation across DP instances.
  2. Decode Load Balancer
    • Balances KVCache memory usage and request loads across GPUs during decoding.
  3. Expert-Parallel Load Balancer
    • Distributes high-load experts across GPUs to avoid computation hotspots.

Real-World Deployment and Cost Efficiency

DeepSeek-V3/R1 is deployed on H800 GPUs, using precision formats aligned with training (FP8 for matrix ops and dispatch, BF16 for core computations). The system scales dynamically to match daily usage patterns:

  • Daytime peak: Full deployment across all nodes.
  • Nighttime low load: Resources reallocated to training and research.

24-Hour Operational Metrics (UTC+8, 02/27/2025–02/28/2025)

  • Peak Node Occupancy: 278 nodes
  • Average Node Occupancy: 226.75 nodes
  • Total Input Tokens: 608B (56.3% cache hit rate)
  • Total Output Tokens: 168B
  • Cost (H800 @ $2/hour): $87,072/day
  • Theoretical Revenue: $562,027/day
  • Profit Margin: ~545% (based on R1 pricing)

Note: Actual revenue is lower due to:

  • Lower pricing for V3 vs. R1,
  • Free access via web and app,
  • Nighttime usage discounts.

Conclusion

DeepSeek-V3/R1 demonstrates how Cross-Node Expert Parallelism combined with prefill-decode disaggregation, communication overlap, and smart load balancing can unlock new levels of efficiency. This system lays a strong foundation for serving trillion-parameter models at scale, offering a compelling vision for the next generation of high-performance AI infrastructure.