#01 Qwen3-Next-80B-A3B-InstructQwen3-Next-80B-A3B-Instruct is the first in the Qwen3-Next series, featuring groundbreaking architectural innovations. It uses Hybrid Attention combining Gated DeltaNet and Gated Attention for efficient ultra-long context modeling, High-Sparsity MoE with 512 experts (10 activated + 1 shared) achieving extreme low activation ratio, and Multi-Token Prediction for improved performance and faster inference. With 80B total parameters and only 3B activated, it outperforms Qwen3-32B-Base with 10% training cost and 10x throughput for 32K+ contexts. The model performs on par with Qwen3-235B-A22B-Instruct-2507 while excelling at ultra-long-context tasks up to 256K tokens (extensible to 1M with YaRN). Architecture: 48 layers, 15T training tokens, hybrid layout of 12*(3*(Gated DeltaNet->MoE)->(Gated Attention->MoE)). | Sep 10, 2025 | | - | 49.8% | - | - | - | |
#02 Qwen3-Next-80B-A3B-ThinkingQwen3-Next-80B-A3B-Thinking is the thinking variant of the Qwen3-Next series, featuring the same groundbreaking architecture as the instruct model. Leveraging GSPO, it addresses stability and efficiency challenges of hybrid attention + high-sparsity MoE in RL training. It uses Hybrid Attention combining Gated DeltaNet and Gated Attention for efficient ultra-long context modeling, High-Sparsity MoE with 512 experts (10 activated + 1 shared), and Multi-Token Prediction. With 80B total parameters and only 3B activated, it demonstrates outstanding performance on complex reasoning tasks — outperforming Qwen3-30B-A3B-Thinking-2507, Qwen3-32B-Thinking, and even the proprietary Gemini-2.5-Flash-Thinking across multiple benchmarks. Architecture: 48 layers, 15T training tokens, hybrid layout of 12*(3*(Gated DeltaNet->MoE)->(Gated Attention->MoE)). Supports only thinking mode with automatic <think> tag inclusion, may generate longer thinking content. | Sep 10, 2025 | | - | - | - | - | - | |
#03 Qwen3-Next-80B-A3B-BaseQwen3-Next-80B-A3B-Base is the foundation model in the Qwen3-Next series, featuring revolutionary architectural innovations for ultimate training and inference efficiency. It introduces Hybrid Attention combining Gated DeltaNet (75% layers) and Gated Attention (25% layers) for efficient ultra-long context modeling, Ultra-Sparse MoE with 512 total experts but only 10 routed + 1 shared expert activated (3.7% activation ratio), and native Multi-Token Prediction for faster inference. With 80B total parameters and only ~3B activated per inference step, it achieves performance comparable to Qwen3-32B while using less than 10% training cost and delivering 10x+ throughput for 32K+ contexts. Trained on 15T tokens with training-stability-friendly designs including Zero-Centered RMSNorm and normalized MoE router parameters. Supports 256K context length, extensible to 1M tokens with YaRN scaling. | Sep 10, 2025 | | - | - | - | - | - | |
#04 Qwen3-235B-A22B-Thinking-2507Qwen3-235B-A22B-Thinking-2507 is a state-of-the-art thinking-enabled Mixture-of-Experts (MoE) model with 235B total parameters (22B activated). It features 94 layers, 128 experts (8 activated), and supports 262K native context length. This version delivers significantly improved reasoning performance, achieving state-of-the-art results among open-source thinking models on logical reasoning, mathematics, science, coding, and academic benchmarks. Key enhancements include markedly better general capabilities (instruction following, tool usage, text generation), enhanced 256K long-context understanding, and increased thinking depth. The model supports only thinking mode with automatic <think> tag inclusion. | Jul 25, 2025 | | - | - | - | - | - | |
#05 Qwen3-235B-A22B-Instruct-2507Qwen3-235B-A22B-Instruct-2507 is the updated instruct version of Qwen3-235B-A22B featuring significant improvements in general capabilities including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. It provides substantial gains in long-tail knowledge coverage across multiple languages and markedly better alignment with user preferences in subjective and open-ended tasks. | Jul 22, 2025 | | - | 57.3% | - | - | - | |
#06 Qwen3 32BQwen3-32B is a large language model from Alibaba's Qwen3 series. It features 32.8 billion parameters, a 128k token context window, support for 119 languages, and hybrid thinking modes allowing switching between deep reasoning and fast responses. It demonstrates strong performance in reasoning, instruction-following, and agent capabilities. | Apr 29, 2025 | | - | - | - | 65.7% | - | |
#07 Qwen3 30B A3BQwen3-30B-A3B is a smaller Mixture-of-Experts (MoE) model from the Qwen3 series by Alibaba, with 30.5 billion total parameters and 3.3 billion activated parameters. Features hybrid thinking/non-thinking modes, support for 119 languages, and enhanced agent capabilities. It aims to outperform previous models like QwQ-32B while using significantly fewer activated parameters. | Apr 29, 2025 | | - | - | - | 62.6% | - | |
#08 Qwen3 235B A22BQwen3 235B A22B is a large language model developed by Alibaba, featuring a Mixture-of-Experts (MoE) architecture with 235 billion total parameters and 22 billion activated parameters. It achieves competitive results in benchmark evaluations of coding, math, general capabilities, and more, compared to other top-tier models. | Apr 29, 2025 | | - | - | - | 70.7% | 81.4% | |
#09 Qwen2.5-Omni-7BQwen2.5-Omni is the flagship end-to-end multimodal model in the Qwen series. It processes diverse inputs including text, images, audio, and video, delivering real-time streaming responses through text generation and natural speech synthesis using a novel Thinker-Talker architecture. | Mar 27, 2025 | | - | - | 78.7% | - | 73.2% | |
#10 QwQ-32BA model focused on advancing AI reasoning capabilities, particularly excelling in mathematics and programming. Features deep introspection and self-questioning abilities while having some limitations in language mixing and recursive/endless reasoning patterns. | Mar 5, 2025 | | - | - | - | 63.4% | - | |