
Matt Willsmore
As I write this at the start of February 2025, the technology share market is just beginning to recover after an enormous plunge wiped a trillion dollars off their value a few days ago. Nvidia, the most prominent GPU manufacturer and a leader in AI lost 17% it’s value in just a few hours. The reason is the unveiling of a new model developed in China called DeepSeek. It remains to be seen whether it has caused a permanent adjustment in the AI market. Below is a detailed overview of why DeepSeek matters, how it differs from massive proprietary models, and some of the technical underpinnings that made the market sit up and pay attention, triggering the massive sell off.

Why All the Fuss?
DeepSeek’s release has generated a buzz in AI communities due to its Open-Source availability and resource-efficient design. It offers a compelling alternative to the established, heavily funded, large-scale models by proving that robust performance doesn’t demand massive investments.
Our DeepSeek overview analysis identified these key factors driving all the excitement:
- It’s open source and free. This allows researchers, businesses, and independent developers to experiment, customize, and deploy DeepSeek without gatekeeping, metered APIs, or restrictive licenses. It also fosters broader collaboration and community-driven improvements.
- It’s lightweight yet powerful enough to compete with larger models. Unlike many AI systems that rely on massive infrastructure to function, DeepSeek’s design requires fewer resources for both training and deployment. This makes it appealing to those who don’t have access to supercomputers or huge server clusters.
- It’s competitive with OpenAI models. Comparisons show that DeepSeek manages comparable or better results than OpenAI’s models on benchmark tests, challenging the notion that performance always scales directly with model size. This opens the door for more efficient AI development that doesn’t break the bank.
- It’s fine-tuneable with reasonable amounts of hardware. Because of its efficient architecture, users can adapt DeepSeek to their specific tasks without a multi-million-dollar infrastructure. This enables startups, smaller organizations, and individual researchers to train and refine their own iterations of the model.
- It’s customizable and even personalizable for individual needs. When fine-tuning is feasible, developers can tailor the model’s behaviour for unique use cases, whether that involves domain-specific knowledge, personalized models, or specialized industry applications.
- You can run it inside corporate firewalls. Being able to run a state-of-the-art model inside the firewall will most likely be cheaper and means companies can have their own dedicated AI resources but importantly will allow it to be used with datasets and workloads that simply can’t be sent to cloud hosted services such as OpenAI.
The fact that it is advanced in its capabilities while also being more widely accessible does, however, raise some interesting safety concerns which need to be sorted out.
What Makes DeepSeek Different?
DeepSeek has been open about the hardware costs and techniques used. This article (How did DeepSeek train it’s AI-model on a lot less and crippled hardware?) provides an excellent and more in depth round-up of the details, some of which I have summarised here. These details demonstrate that the trend of using more and more brute force to create ever bigger and better models is based on a false assumption. DeepSeek is different because it demonstrates you can do more with less through clever optimisations. Here are the details:
- R1 was trained on 2,788 GPUs at an estimated cost of around $6 million, whereas training OpenAI’s GPT-4 reportedly cost around $100 million. This suggests that state-of-the-art models do not necessarily require massive spending.
- It proves that brute force with more GPUs and bigger models isn’t always correlated with better AI performance. DeepSeek challenges the long-held assumption that size alone guarantees superior results, encouraging more strategic, innovative training practices.
- DeepSeek shows it’s possible to achieve high performance through sophisticated optimization methods rather than sheer magnitude of parameters and hardware.
- Although, in the US there is suspicion about the project originating from China, the team has shown transparency regarding its methods and breakthroughs. This means similar models may soon appear in other regions.
DeepSeek’s Technical Details
Beyond the headlines about low cost and high performance, our DeepSeek overview analysis reveals that its underlying structure is equally noteworthy. They focus on parallelizing communication & computation through careful scheduling and memory optimization, quantization, and load-balancing strategies to optimize efficiency within the cluster. These principles not only lower training expenses but also improve training speed and efficiency:
- DeepSeek offers two versions, R1 and V3. R1 includes additional reinforcement learning and supervised fine-tuning steps, while V3 has a smaller output window. Both are Mixture of Experts models with 671B parameters. A direct comparison can be found at this link, which shows how those variations affect performance and context handling. A full technical report is available here.
- The team reports 2.66 million GPU-hours on H800 graphics chips for pretraining, followed by 119,000 GPU-hours for context extension and 5,000 GPU-hours for supervised fine-tuning and reinforcement learning. This totals 2.79 million GPU-hours, which could cost roughly $5.58 million at a rate of $2 per GPU-hour.
- Training involved 256 server nodes, each equipped with eight H800 GPU accelerators, for a total of 2,048 GPUs. While significant, this setup is still significantly smaller and lower spec than the huge clusters often associated with big-name AI labs, underscoring DeepSeek’s emphasis on efficiency.
- On 2,048 H800 GPUs, it would take under two months to train DeepSeek-V3. This timeframe shows that a well-optimized system with a reasonable GPU cluster can develop a top-tier model in a manageable period, suggesting faster iteration cycles are possible.
DeepSeek’s Technical Innovations
DeepSeek’s team prioritizes novel training and optimization methods over sheer size. They’ve introduced a communication accelerator called DualPipe to optimize all-to-all communication between cluster nodes, and to optimize scheduling of communication versus computation operations. These innovations demonstrate that engineering clever optimisations can unlock high-impact improvements:
- Training efficiency is boosted by accelerating communication across GPUs through better scheduling and coordination between different nodes. This streamlined data exchange keeps GPUs busy and reduces idle time, thereby lowering overall training periods.
- DualPipe is the name of the specialized accelerator mechanism that handles all-to-all communication within the cluster, maximizing computation and communication overlap. By increasing parallelism, memory footprint is also optimized, making the system leaner during training.
- Mixture of Experts (MoE) architectures are made more efficient through novel load balancing algorithms, ensuring that no single expert module becomes a bottleneck. This leads to more efficient, performant training.
- DeepSeek employs multiple quantization techniques, fine-tuning numerical precision according to the specific needs of each operation. This allows the model to reduce computational overhead without noticeably compromising accuracy, striking a balance between speed and fidelity.
- Both R1 and V3 incorporate these optimizations, but apparently R1 additionally leverages outputs from other AI models during two supervised fine-tuning stages and two reinforcement learning stages. This extra input strengthens R1’s reasoning capabilities and contextual understanding.
- The refined “chain of thought” reasoning in R1 is eventually distilled back into V3, enabling the smaller model to benefit from advanced reasoning skills. Distillation ensures V3 stays lightweight yet gains the sophisticated analytical abilities pioneered in R1.
DeepSeek Overview: Conclusion
DeepSeek represents a significant shift in how we think about AI scaling and performance. By challenging the notion that bigger models, larger GPU clusters, and greater spending automatically equate to superior results. It’s Open-Source status empowers developers to take advantage of the model, without necessarily relying on the tech giant’s infrastructure. This challenge to their dominance is the main reason for the sudden shift in market thinking a couple of days ago.
If you’re finding it hard to keep up with the past pace of all this stuff, but want some advice on how it might benefit your company, as always, feel free to CONTACT US with questions or to request a complimentary consultation to discuss your ongoing search and AI projects.
Cheers,
Matt
Additional Resources
- DeepSeek R1 Explained by a Retired Microsoft Engineer – Dave’s Garage on YouTube
- Try DeepSeek without security risk on Perplexity – but it’s still “censored.”
- Apple researchers reveal the secret sauce behind DeepSeek AI | ZDNET
- 7 Tech Trends in AI and Search for 2025 – Pureinsights – see #1 on this list!
- 1-Bit LLMs: The Future of Efficient AI? – Pureinsights