AI Search explores the technical architecture behind DeepSeek V4, detailing how a compact team achieved massive scale despite limited computational resources.
The analysis breaks down innovations in hybrid attention systems, manifold constrained hyperconnections, and optimized training pipelines that allow this 1.6 trillion parameter model to manage a 1 million token context window efficiently.
00:00 - Deepseek V4 intro
01:00 - Deepseek V4 specs
02:06 - The challenge of 1M context
04:16 - Hybrid attention
05:11 - CSA & sparse selection
06:50 - HCA
08:22 - Sliding window attention
10:44 - Insane efficiency gains
12:02 - Signal explosion
13:00 - Residual connections
13:52 - mHC
14:17 - ChatLLM
15:24 - mHC continued
17:54 - Muon
19:26 - Infra challenges
22:31 - Training challenges
24:09 - Anticipatory routing
25:24 - SOTA results