DeepSeek-V2: An Efficient and Economical Mixture-of-Experts LLM

Large Language Models (LLMs) have revolutionized various fields, showcasing remarkable capabilities in understanding and generating human-like text. However, the increasing size of these models often comes with a significant computational cost, hindering their accessibility and deployment.

DeepSeek-V2 (a cutting-edge, open-source language model built on the Mixture-of-Experts architecture) addresses this challenge by incorporating innovative architectural designs and training methodologies to achieve a balance between performance and efficiency. With a massive 236 billion parameters, of which only 21 billion are activated for each input token, DeepSeek-V2 can handle context lengths of up to 128,000 tokens, allowing it to process and understand extensive text sequences. End. More details in:

  1. Architectural Innovations
  2. Multi-head Latent Attention (MLA) and how it Boosts Inference Efficiency
  3. DeepSeekMoE and how to Train Strong Models Economically
  4. Alignment with Human Preferences
  5. Evaluation Results
  6. Conclusion

Architectural Innovations

The overall architecture, explained in DeepSeek-V2’s paper, is shown in the following image.

DeepSeek-V2 architecture

As you can see, DeepSeek-V2 builds on the powerful Transformer foundation (invented by Vaswani et al. in 2017), which uses building blocks that combine attention mechanisms and processing layers. However, to be both effective and efficient, DeepSeek-V2 takes this concept further by introducing two key architectural innovations:

1. Multi-head Latent Attention (MLA)

MLA tackles the bottleneck of traditional Multi-Head Attention (MHA) mechanisms, which suffer from large Key-Value (KV) cache requirements during inference. This large cache limits the efficiency of generating text, especially for long sequences.

MLA utilizes a low-rank joint compression technique to significantly reduce the KV cache size without compromising performance. Instead of storing the full KV matrices, MLA compresses them into a much smaller latent vector. This compression significantly reduces the memory footprint and computation required during text generation.

To illustrate this, imagine a library with thousands of books (representing keys and values). MHA requires storing all these books for reference. MLA, on the other hand, creates a concise summary (latent vector) of each book, significantly reducing storage space while retaining essential information.

Comparison between Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Multi-head Latent Attention (MLA)

2. DeepSeekMoE

DeepSeekMoE is a specialized MoE architecture designed for efficient expert utilization. In MoE models, different “experts” specialize in different aspects of the data, allowing for greater model capacity and expressiveness. However, traditional MoE architectures often suffer from knowledge redundancy and limited expert specialization.

DeepSeekMoE addresses these limitations dividing experts into smaller, more specialized units, allowing for more precise knowledge acquisition and reducing redundancy, and isolating a subset of shared experts, to prevent knowledge duplication among routed experts.

Think of this as a team of specialized doctors: DeepSeekMoE divides each medical specialty into sub-specialties, ensuring each expert has a deep understanding of their specific area. Shared experts, like general practitioners, provide common knowledge, preventing unnecessary duplication among specialists.

Multi-head Latent Attention (MLA) and how it Boosts Inference Efficiency

Traditional Transformer models rely on Multi-Head Attention (MHA) mechanisms, which, while powerful, suffer from large Key-Value (KV) caches during inference. This large KV cache slows down the model’s ability to process and generate text.

As already said, MLA tackles this issue by compressing the KV cache into a smaller latent vector, significantly reducing its size without sacrificing performance.

Here’s how MLA works:

  • Joint Compression: MLA compresses keys and values into a latent vector using learned projection matrices.
  • Inference Optimization: During inference, MLA eliminates the need to explicitly compute keys and values by absorbing the projection matrices into other parts of the model, further enhancing efficiency.
  • Decoupled Rotary Position Embedding: MLA utilizes Rotary Position Embedding (RoPE) in a unique way to ensure both efficient compression and strong performance. Ok, but what’s RoPE? RoPE is a technique that encodes the order of words in a sentence by subtly rotating their corresponding embeddings. This rotation helps the model understand the relationship between words even when they are far apart. However, directly applying RoPE to the compressed keys would interfere with MLA’s compression process. To avoid this, MLA employs a “decoupled” approach, using separate components to handle RoPE information. This ensures that the key-value compression remains effective while still benefiting from RoPE’s ability to capture word order and enhance the model’s understanding of language.

MLA’s impact on efficiency is significant, reducing DeepSeek-V2’s KV cache, so much that it is 93.3% smaller than its predecessor, DeepSeek 67B. This allows for faster inference and the ability to handle larger batch sizes and longer text sequences.

DeepSeekMoE and how to Train Strong Models Economically

DeepSeekMoE is an innovative architecture used for Feed-Forward Networks (FFNs) within DeepSeek-V2. It enables the training of powerful models at a fraction of the cost by selectively activating only relevant “experts” for each token.

With another example, think of DeepSeekMoE as a team of specialized experts: when a task arrives, only the experts relevant to that specific task are called upon, ensuring efficient resource utilization.

As partially already introduced, DeepSeekMoE is able to train strong models at an economical cost by employing several key techniques:

  • Finer-grained expert segmentation: DeepSeekMoE divides the experts into smaller, more specialized units compared to conventional MoE architectures. This allows for higher expert specialization and more accurate knowledge acquisition.
  • Shared experts: Some experts are designated as shared experts that are always activated, while the remaining experts are routed experts that are selectively activated for each token. Having some shared experts helps mitigate redundancy in the knowledge captured by the routed experts.
  • Device-limited routing: To control MoE-related communication costs, DeepSeekMoE ensures that the activated experts for each token are distributed across at most M devices. This bounds the communication frequency per token while still allowing a large number of experts to be utilized, limiting the number of devices involved in processing each token, reducing communication overhead.
  • Auxiliary losses for load balancing: DeepSeekMoE incorporates three types of auxiliary losses to ensure balanced computation and prevent certain experts from being overloaded (so, preventing bottlenecks and ensuring efficient utilization of resources): Expert-Level Balance Loss, that promotes balanced activation of individual experts, Device-Level Balance Loss, that encourages balanced computation across different devices, and Communication Balance Loss, that ensures balanced communication loads between devices.
  • Token-dropping strategy: To further handle any remaining load imbalance, a device-level token-dropping approach is used during training. The lowest-affinity tokens on each device are dropped until the computation budget is met. This improves computational efficiency.

Also, DeepSeek-V2 exploits High-Quality Data, being pre-trained on a massive dataset of 8.1 trillion tokens, with a focus on data quality and diversity, particularly for Chinese language content.

These optimizations result in significant cost savings during training. Compared to its predecessor, DeepSeek 67B, DeepSeek-V2 achieves a 42.5% reduction in training costs, demonstrating the effectiveness of its sparse architecture and efficient training methodologies.

The model’s inference efficiency is further enhanced by converting parameters to FP8 precision and applying KV cache quantization. This results in a significantly smaller KV cache compared to DeepSeek 67B, allowing DeepSeek-V2 to handle larger batch sizes and achieve a generation throughput exceeding 50,000 tokens per second (5.76 times faster than DeepSeek 67B!!!).

Alignment with Human Preferences

To ensure DeepSeek-V2 aligns with human expectations and preferences, it undergoes two stages of alignment:

1. Supervised Fine-Tuning (SFT)

DeepSeek-V2 is fine-tuned on a dataset of 1.5 million carefully curated conversational instances, focusing on helpfulness and safety. This dataset is meticulously refined to minimize incorrect or nonsensical outputs (hallucinations) and improve the quality of generated text. With this stage the model DeepSeek-V2 (SFT) is created.

2. Reinforcement Learning (RL)

DeepSeek-V2’s alignment is further enhanced through Reinforcement Learning using Group Relative Policy Optimization (GRPO), that allows the model to learn from feedback and improve its responses based on human preferences.

The RL process employs a two-stage approach:

  • Reasoning Alignment: Focuses on improving the model’s reasoning abilities in specific domains like code and math, training a reward model based on compiler feedback and ground-truth labels.
  • Human Preference Alignment: Uses a multi-reward framework incorporating helpfulness, safety, and rule-based rewards to align the model with general human preferences across different conversational aspects.

With this stage the model DeepSeek-V2 (RL) is created.

Evaluation Results

DeepSeek-V2 is evaluated on a comprehensive suite of benchmarks, showcasing its strong performance across various domains and languages:

1. Standard Benchmarks:

  • MMLU: A multi-task benchmark measuring a model’s understanding and reasoning across various subjects.
  • BBH: BigBench Hard, a suite of challenging tasks designed to test a model’s reasoning and problem-solving abilities.
  • C-Eval & CMMLU: Chinese language benchmarks evaluating a model’s understanding and reasoning in Chinese.
  • HumanEval & MBPP: Code generation benchmarks measuring a model’s ability to generate correct and functional code.
  • GSM8K & MATH: Math word problem benchmarks evaluating a model’s ability to solve mathematical problems presented in natural language.

Comparison of open source models

Notice how the final model (RL) is superior to the SFT model when it comes to code and math.

2. Open-Ended Generation:

  • AlpacaEval 2.0 & MT-Bench: English language benchmarks assessing a model’s performance on various conversational tasks.
  • AlignBench: A Chinese language benchmark evaluating a model’s ability to understand and generate human-like responses in Chinese.

Comparison with gpt-3.5 and gpt-4

DeepSeek-V2’s strong performance across these diverse benchmarks solidifies its position as a leading open-source language model, showcasing its capabilities in both English and Chinese, as well as specialized domains like code and math.

Conclusion

DeepSeek-V2 represents a significant advancement in open-source language modeling, offering a powerful and efficient solution for various applications. Its innovative architecture, efficient training methodologies, and rigorous alignment process enable it to achieve top-tier performance while remaining computationally cost-effective.

While DeepSeek-V2 shares some limitations with other LLMs, such as the potential for generating inaccurate information and limited proficiency in languages beyond English and Chinese, its strengths and ongoing development make it a valuable resource for researchers and developers working with large language models. Also, it’s open-source, so it’s good even if not perfect!

( text taken from my website, if you like it subscribe otherwise… no!)

View at Medium.com

Leave a Reply