From f40c98bf15bf6c281f5f7ad5e20092426c9acf57 Mon Sep 17 00:00:00 2001 From: Disya123 Date: Sun, 8 Mar 2026 00:25:38 +0600 Subject: [PATCH] vault backup: 2026-03-08 00:25:38 --- .../.obsidian/plugins/obsidian-git/data.json | 4 +- Obsidian/Qwen3.5.md | 140 ++++++++++++++++-- 2 files changed, 129 insertions(+), 15 deletions(-) diff --git a/Obsidian/.obsidian/plugins/obsidian-git/data.json b/Obsidian/.obsidian/plugins/obsidian-git/data.json index 1370099..050c7ff 100644 --- a/Obsidian/.obsidian/plugins/obsidian-git/data.json +++ b/Obsidian/.obsidian/plugins/obsidian-git/data.json @@ -7,8 +7,8 @@ "autoPushInterval": 0, "autoPullInterval": 0, "autoPullOnBoot": true, - "autoCommitOnlyStaged": true, - "disablePush": true, + "autoCommitOnlyStaged": false, + "disablePush": false, "pullBeforePush": false, "disablePopups": false, "showErrorNotices": true, diff --git a/Obsidian/Qwen3.5.md b/Obsidian/Qwen3.5.md index ebcaaed..f638868 100644 --- a/Obsidian/Qwen3.5.md +++ b/Obsidian/Qwen3.5.md @@ -25,63 +25,177 @@ Qwen3.5 represents a major advancement over prior models in the Qwen series, suc ### Key features -**Qwen3.5** introduces several defining innovations that distinguish it from prior models in the series. Central to its design is a unified vision-language foundation built through early fusion training on trillions of multimodal tokens, enabling seamless integration of text and visual inputs from the earliest stages of processing.[^1][^4] The series expands global linguistic coverage to 201 languages and dialects, an increase from 119 in previous versions, supporting nuanced cross-lingual reasoning and inclusive deployment across diverse cultural and regional contexts.[^1][^3] Qwen3.5 incorporates scalable reinforcement learning across million-agent environments with progressively complex task distributions, promoting robust generalization and adaptability in real-world agentic scenarios.[^1][^4] Training infrastructure achieves near-100% multimodal training efficiency relative to text-only baselines, facilitated by asynchronous RL frameworks and heterogeneous parallelism that maintain high throughput on mixed data.[^1][^3] Native multimodal capabilities encompass vision, video, and STEM data processing, allowing the model to handle image understanding, video analysis, scientific reasoning, and related tasks within a single unified framework.[^1][^4] These advancements are supported by an efficient hybrid architecture that enables high-throughput inference with minimal overhead.[^4] +**Qwen3.5** introduces several defining innovations that distinguish it from prior models in the series. Central to its design is a unified vision-language foundation built through early fusion training on trillions of multimodal tokens, enabling seamless integration of text and visual inputs from the earliest stages of processing.[^1][^4] + +The series expands global linguistic coverage to 201 languages and dialects, an increase from 119 in previous versions, supporting nuanced cross-lingual reasoning and inclusive deployment across diverse cultural and regional contexts.[^1][^3] + +Qwen3.5 incorporates scalable reinforcement learning across million-agent environments with progressively complex task distributions, promoting robust generalization and adaptability in real-world agentic scenarios.[^1][^4] + +Training infrastructure achieves near-100% multimodal training efficiency relative to text-only baselines, facilitated by asynchronous RL frameworks and heterogeneous parallelism that maintain high throughput on mixed data.[^1][^3] + +Native multimodal capabilities encompass vision, video, and STEM data processing, allowing the model to handle image understanding, video analysis, scientific reasoning, and related tasks within a single unified framework.[^1][^4] + +These advancements are supported by an efficient hybrid architecture that enables high-throughput inference with minimal overhead.[^4] Architecture ------------ ### Hybrid architecture -Qwen3.5 employs a hybrid architecture that integrates Gated Delta Networks, a form of linear attention mechanism, with a sparse Mixture-of-Experts (MoE) design to achieve efficient processing while maintaining strong performance.[^1][^4] The architecture alternates attention mechanisms across layers in a structured pattern, typically organized as repeated blocks where three consecutive layers use Gated DeltaNet followed by a feed-forward network (FFN), and one layer uses Gated Attention followed by an FFN, creating a 3:1 hybrid ratio of linear to full attention mechanisms. Gated DeltaNet provides linear complexity attention for reduced computational demands, particularly beneficial for long sequences, while Gated Attention layers deliver enhanced representational power where needed.[^4] This hybrid approach combines with sparse MoE routing, where only a subset of experts activates per token, yielding significant sparsity in computation. In the flagship model, this results in only 17 billion active parameters out of 397 billion total parameters.[^1] The design natively supports a context length of 262,144 tokens and can extend to over 1 million tokens through techniques such as RoPE scaling.[^4] By leveraging this combination, the architecture enables high-throughput inference with minimal latency and cost overhead compared to traditional dense models.[^1] +Qwen3.5 employs a hybrid architecture that integrates Gated Delta Networks, a form of linear attention mechanism, with a sparse Mixture-of-Experts (MoE) design to achieve efficient processing while maintaining strong performance.[^1][^4] + +The architecture alternates attention mechanisms across layers in a structured pattern, typically organized as repeated blocks where three consecutive layers use Gated DeltaNet followed by a feed-forward network (FFN), and one layer uses Gated Attention followed by an FFN, creating a 3:1 hybrid ratio of linear to full attention mechanisms. Gated DeltaNet provides linear complexity attention for reduced computational demands, particularly beneficial for long sequences, while Gated Attention layers deliver enhanced representational power where needed.[^4] + +This hybrid approach combines with sparse MoE routing, where only a subset of experts activates per token, yielding significant sparsity in computation. In the flagship model, this results in only 17 billion active parameters out of 397 billion total parameters.[^1] + +The design natively supports a context length of 262,144 tokens and can extend to over 1 million tokens through techniques such as RoPE scaling.[^4] + +By leveraging this combination, the architecture enables high-throughput inference with minimal latency and cost overhead compared to traditional dense models.[^1] ### Model variants -The Qwen3.5 family comprises open-source multimodal large language models released in phases by the Qwen team at Alibaba Cloud, beginning with the flagship model on February 16, 2026, and the Qwen3.5 Medium Model series on February 25, 2026.[^6][^7][^2] The initial release is Qwen3.5-397B-A17B, featuring 397 billion total parameters and 17 billion active parameters.[^8] Subsequent models include Qwen3.5-122B-A10B with 122 billion total parameters and 10 billion active parameters, Qwen3.5-35B-A3B with 35 billion total parameters and 3 billion active parameters, the dense variant Qwen3.5-27B with 27 billion parameters, and Qwen3.5-9B with 9 billion parameters.[^9][^4][^10] The Qwen3.5-9B is a multimodal model supporting text and image inputs, featuring a 256K context window, strong reasoning and coding performance, and support for 201 languages and dialects. It is available for local execution on Ollama with an approximate download size of 6.6 GB via the command `ollama run qwen3.5:9b`. Quantized versions (e.g., GPTQ-Int4) enable inference on consumer GPUs with 8 GB VRAM, such as the NVIDIA RTX 4060.[^10][^11] The Qwen3.5-2B is a lightweight variant with 2 billion parameters, released on March 2, 2026 as part of the small models series optimized for efficient on-device inference and local deployment on consumer hardware such as mobile devices. Quantized versions enable its use on GPUs with 8 GB VRAM, such as the NVIDIA RTX 4060. It supports multimodal inputs (text and images), features a 256K context window, strong reasoning and coding performance, toggleable thinking modes, and compatibility with the MLX framework for Apple Silicon to enable smooth on-device execution. It leverages the Gated DeltaNet hybrid architecture for enhanced efficiency and performance in logic and math tasks.[^12][^1] The Qwen3.5-35B-A3B is a sparse Mixture-of-Experts (MoE) model. While the model excels in efficiency, speed, reasoning, and coding due to its MoE design with 3B active parameters out of 35B total, community observations indicate mixed reviews for roleplay and creative writing, where denser models like Qwen3.5-27B are often preferred for superior character consistency, descriptive prose, and avoiding repetitive or shallow output, as the limited active parameters can reduce depth in nuanced creative tasks.[^13] The larger variants employ a sparse Mixture-of-Experts architecture to achieve efficiency in multimodal processing.[^8][^9] Hosted versions are available through Alibaba Cloud Model Studio, including Qwen3.5-Plus corresponding to the 397B model and the proprietary Qwen3.5-Flash corresponding to the 35B-A3B model.[^8][^3][^2] Additional sizes in the Qwen3.5 family are forthcoming.[^14] +The Qwen3.5 family comprises open-source multimodal large language models released in phases by the Qwen team at Alibaba Cloud, beginning with the flagship model on February 16, 2026, and the Qwen3.5 Medium Model series on February 25, 2026.[^6][^7][^2] + +The initial release is Qwen3.5-397B-A17B, featuring 397 billion total parameters and 17 billion active parameters.[^8] + +Subsequent models include Qwen3.5-122B-A10B with 122 billion total parameters and 10 billion active parameters, Qwen3.5-35B-A3B with 35 billion total parameters and 3 billion active parameters, the dense variant Qwen3.5-27B with 27 billion parameters, and Qwen3.5-9B with 9 billion parameters.[^9][^4][^10] + +The Qwen3.5-9B is a multimodal model supporting text and image inputs, featuring a 256K context window, strong reasoning and coding performance, and support for 201 languages and dialects. It is available for local execution on Ollama with an approximate download size of 6.6 GB via the command `ollama run qwen3.5:9b`. Quantized versions (e.g., GPTQ-Int4) enable inference on consumer GPUs with 8 GB VRAM, such as the NVIDIA RTX 4060.[^10][^11] + +The Qwen3.5-2B is a lightweight variant with 2 billion parameters, released on March 2, 2026 as part of the small models series optimized for efficient on-device inference and local deployment on consumer hardware such as mobile devices. Quantized versions enable its use on GPUs with 8 GB VRAM, such as the NVIDIA RTX 4060. It supports multimodal inputs (text and images), features a 256K context window, strong reasoning and coding performance, toggleable thinking modes, and compatibility with the MLX framework for Apple Silicon to enable smooth on-device execution. It leverages the Gated DeltaNet hybrid architecture for enhanced efficiency and performance in logic and math tasks.[^12][^1] + +The Qwen3.5-35B-A3B is a sparse Mixture-of-Experts (MoE) model. While the model excels in efficiency, speed, reasoning, and coding due to its MoE design with 3B active parameters out of 35B total, community observations indicate mixed reviews for roleplay and creative writing, where denser models like Qwen3.5-27B are often preferred for superior character consistency, descriptive prose, and avoiding repetitive or shallow output, as the limited active parameters can reduce depth in nuanced creative tasks.[^13] + +The larger variants employ a sparse Mixture-of-Experts architecture to achieve efficiency in multimodal processing.[^8][^9] + +Hosted versions are available through Alibaba Cloud Model Studio, including Qwen3.5-Plus corresponding to the 397B model and the proprietary Qwen3.5-Flash corresponding to the 35B-A3B model.[^8][^3][^2] + +Additional sizes in the Qwen3.5 family are forthcoming.[^14] ### Training methodology -**Training methodology** Qwen3.5 employs an early fusion approach for its native multimodal capabilities, training directly on trillions of multimodal tokens to integrate vision and language modalities from the pretraining stage. This unified vision-language foundation enables cross-modal understanding without the need for separate late-stage alignment, outperforming previous Qwen-VL models at comparable scales across reasoning, coding, agentic, and visual tasks.[^1][^4][^3] Pretraining leverages a significantly larger and higher-quality dataset than prior Qwen models, with enriched Chinese/English, multilingual, STEM, reasoning, and visual/video content subjected to stricter filtering to ensure data cleanliness and relevance. This emphasis on data quality allows models such as the 397B-A17B variant to achieve performance parity with much larger (>1T parameter) predecessors.[^3][^15] The training infrastructure adopts a heterogeneous design that decouples parallelism strategies between vision and language components, enabling simultaneous computation without synchronization bottlenecks and delivering near-100% training throughput relative to text-only baselines on mixed multimodal data. This is complemented by native FP8 pipelines and stability optimizations that reduce activation memory and improve scaling stability across tens of trillions of tokens.[^3][^9][^15] Multi-step training incorporates multi-token prediction objectives, which enhance sample efficiency and downstream performance by encouraging the model to predict multiple future tokens simultaneously during pretraining.[^4] Post-training relies on scalable asynchronous reinforcement learning frameworks that support massive-scale agent environments and multi-turn interactions. These frameworks use fully disaggregated training-inference architectures, FP8 end-to-end training, rollout router replay, speculative decoding, and multi-turn rollout locking to achieve 3×–5× end-to-end speedup while maintaining gradient freshness and mitigating data skewness. This enables robust generalization across text, multimodal, and agentic tasks.[^3][^9][^15] +**Training methodology** Qwen3.5 employs an early fusion approach for its native multimodal capabilities, training directly on trillions of multimodal tokens to integrate vision and language modalities from the pretraining stage. This unified vision-language foundation enables cross-modal understanding without the need for separate late-stage alignment, outperforming previous Qwen-VL models at comparable scales across reasoning, coding, agentic, and visual tasks.[^1][^4][^3] + +Pretraining leverages a significantly larger and higher-quality dataset than prior Qwen models, with enriched Chinese/English, multilingual, STEM, reasoning, and visual/video content subjected to stricter filtering to ensure data cleanliness and relevance. This emphasis on data quality allows models such as the 397B-A17B variant to achieve performance parity with much larger (>1T parameter) predecessors.[^3][^15] + +The training infrastructure adopts a heterogeneous design that decouples parallelism strategies between vision and language components, enabling simultaneous computation without synchronization bottlenecks and delivering near-100% training throughput relative to text-only baselines on mixed multimodal data. This is complemented by native FP8 pipelines and stability optimizations that reduce activation memory and improve scaling stability across tens of trillions of tokens.[^3][^9][^15] + +Multi-step training incorporates multi-token prediction objectives, which enhance sample efficiency and downstream performance by encouraging the model to predict multiple future tokens simultaneously during pretraining.[^4] + +Post-training relies on scalable asynchronous reinforcement learning frameworks that support massive-scale agent environments and multi-turn interactions. These frameworks use fully disaggregated training-inference architectures, FP8 end-to-end training, rollout router replay, speculative decoding, and multi-turn rollout locking to achieve 3×–5× end-to-end speedup while maintaining gradient freshness and mitigating data skewness. This enables robust generalization across text, multimodal, and agentic tasks.[^3][^9][^15] Capabilities ------------- +--- ### Linguistic capabilities -**Qwen3.5** supports **201 languages and dialects**, marking a substantial expansion from the 119 languages covered in prior Qwen series models.[^3][^1] This broad multilingual coverage enables nuanced understanding of cultural and regional nuances, facilitating inclusive global deployment across diverse linguistic contexts.[^1][^4] The models employ a vocabulary of approximately **250,000 tokens** (248,320 padded in some configurations), an increase from earlier versions that improves encoding and decoding efficiency by 10–60% across most languages.[^3][^4] Qwen3.5 demonstrates strong text-based performance in knowledge-intensive tasks, precise instruction following, long-context processing (with native support for up to 262,144 tokens, extendable further), complex reasoning, coding proficiency, and multilingual applications.[^3][^4] These strengths arise from enriched pretraining data, including expanded multilingual and STEM content, combined with the family's hybrid architecture and scalable reinforcement learning.[^3][^1] +**Qwen3.5** supports **201 languages and dialects**, marking a substantial expansion from the 119 languages covered in prior Qwen series models.[^3][^1] + +This broad multilingual coverage enables nuanced understanding of cultural and regional nuances, facilitating inclusive global deployment across diverse linguistic contexts.[^1][^4] + +The models employ a vocabulary of approximately **250,000 tokens** (248,320 padded in some configurations), an increase from earlier versions that improves encoding and decoding efficiency by 10–60% across most languages.[^3][^4] + +Qwen3.5 demonstrates strong text-based performance in knowledge-intensive tasks, precise instruction following, long-context processing (with native support for up to 262,144 tokens, extendable further), complex reasoning, coding proficiency, and multilingual applications.[^3][^4] + +These strengths arise from enriched pretraining data, including expanded multilingual and STEM content, combined with the family's hybrid architecture and scalable reinforcement learning.[^3][^1] ### Multimodal capabilities -Qwen3.5 is a native vision-language model that integrates text and visual inputs through early fusion training on trillions of multimodal tokens, enabling seamless cross-modal understanding.[^4][^1] This design supports processing of images and videos alongside text, delivering strong performance in visual reasoning and perception tasks.[^3] The model excels in comprehensive image understanding, including spatial intelligence (e.g., object counting and spatial reasoning), STEM and puzzle solving (e.g., geometric and mathematical visual problems), text recognition via robust optical character recognition, document processing (e.g., extraction and interpretation from complex layouts), and medical visual question answering (e.g., analysis of medical images).[^3] It demonstrates advanced capabilities in these areas through high scores on specialized benchmarks, such as strong results in spatial tasks (e.g., CountBench at 97.2) and document understanding (e.g., OmniDocBench1.5 at 90.8).[^3] For video understanding, Qwen3.5 supports processing of hour-scale videos, extensible to 1M token inputs via techniques like RoPE scaling, allowing analysis of extended content such as long-form footage with detailed queries (e.g., object counting or event description across frames).[^4] This enables effective handling of dynamic visual sequences, with competitive performance on video benchmarks.[^3] Compared to prior Qwen3-VL models, Qwen3.5 shows marked improvements in visual tasks through its larger scale of visual-text training data, enriched multimodal datasets, and native fusion architecture.[^4][^3] +Qwen3.5 is a native vision-language model that integrates text and visual inputs through early fusion training on trillions of multimodal tokens, enabling seamless cross-modal understanding.[^4][^1] + +This design supports processing of images and videos alongside text, delivering strong performance in visual reasoning and perception tasks.[^3] + +The model excels in comprehensive image understanding, including spatial intelligence (e.g., object counting and spatial reasoning), STEM and puzzle solving (e.g., geometric and mathematical visual problems), text recognition via robust optical character recognition, document processing (e.g., extraction and interpretation from complex layouts), and medical visual question answering (e.g., analysis of medical images).[^3] It demonstrates advanced capabilities in these areas through high scores on specialized benchmarks, such as strong results in spatial tasks (e.g., CountBench at 97.2) and document understanding (e.g., OmniDocBench1.5 at 90.8).[^3] + +For video understanding, Qwen3.5 supports processing of hour-scale videos, extensible to 1M token inputs via techniques like RoPE scaling, allowing analysis of extended content such as long-form footage with detailed queries (e.g., object counting or event description across frames).[^4] + +This enables effective handling of dynamic visual sequences, with competitive performance on video benchmarks.[^3] + +Compared to prior Qwen3-VL models, Qwen3.5 shows marked improvements in visual tasks through its larger scale of visual-text training data, enriched multimodal datasets, and native fusion architecture.[^4][^3] ### Agentic capabilities -Qwen3.5 exhibits advanced agentic capabilities, positioning it as a foundation for autonomous digital agents through native multimodal reasoning, adaptive tool use, and robust performance across agent-oriented benchmarks. The model supports seamless multi-turn workflows and autonomous task execution, enabling it to handle complex, long-horizon interactions without framework interruptions.[^16][^3] A key strength lies in its adaptive tool use and integration. Qwen3.5 includes official built-in tools such as web search and a code interpreter, which can be dynamically invoked via parameters like `enable_search` and `enable_thinking`. It integrates with frameworks like Qwen-Agent for custom tool definition and execution, supporting use cases such as filesystem management and real-time project iteration. This enables autonomous code interpretation, multi-step planning, and tool-augmented reasoning, as demonstrated in examples where the model organizes desktops or develops websites from natural-language instructions.[^16][^9] The model demonstrates strong visual agentic abilities through native multimodal fusion. It autonomously interacts with smartphones and computers, executing actions across mobile apps and desktop workflows, such as filling spreadsheets or navigating GUI elements based on natural-language commands. These capabilities extend to embodied interfaces, with proficiency in screen-based tasks, mobile environments, and spatial reasoning for applications like robotic navigation or autonomous driving scene understanding.[^16][^3] Qwen3.5 shows potential for advanced agentic features, including persistent memory for cross-session learning, self-directed improvement mechanisms, and economic awareness for constrained operation. Its scalable asynchronous reinforcement learning framework supports general and search agents across multi-turn settings, enhancing generalization in real-world scenarios.[^16][^3] Representative benchmarks highlight these strengths: Qwen3.5-397B-A17B achieves 72.9 on BFCL-V4 for general tool use, 49.7 on VITA-Bench for multimodal agentic interaction, and 34.3 on DeepPlanning for advanced planning.[^16][^3] +Qwen3.5 exhibits advanced agentic capabilities, positioning it as a foundation for autonomous digital agents through native multimodal reasoning, adaptive tool use, and robust performance across agent-oriented benchmarks. The model supports seamless multi-turn workflows and autonomous task execution, enabling it to handle complex, long-horizon interactions without framework interruptions.[^16][^3] + +A key strength lies in its adaptive tool use and integration. Qwen3.5 includes official built-in tools such as web search and a code interpreter, which can be dynamically invoked via parameters like `enable_search` and `enable_thinking`. It integrates with frameworks like Qwen-Agent for custom tool definition and execution, supporting use cases such as filesystem management and real-time project iteration. This enables autonomous code interpretation, multi-step planning, and tool-augmented reasoning, as demonstrated in examples where the model organizes desktops or develops websites from natural-language instructions.[^16][^9] + +The model demonstrates strong visual agentic abilities through native multimodal fusion. It autonomously interacts with smartphones and computers, executing actions across mobile apps and desktop workflows, such as filling spreadsheets or navigating GUI elements based on natural-language commands. These capabilities extend to embodied interfaces, with proficiency in screen-based tasks, mobile environments, and spatial reasoning for applications like robotic navigation or autonomous driving scene understanding.[^16][^3] + +Qwen3.5 shows potential for advanced agentic features, including persistent memory for cross-session learning, self-directed improvement mechanisms, and economic awareness for constrained operation. Its scalable asynchronous reinforcement learning framework supports general and search agents across multi-turn settings, enhancing generalization in real-world scenarios.[^16][^3] + +Representative benchmarks highlight these strengths: Qwen3.5-397B-A17B achieves 72.9 on BFCL-V4 for general tool use, 49.7 on VITA-Bench for multimodal agentic interaction, and 34.3 on DeepPlanning for advanced planning.[^16][^3] ### Creative writing and roleplay capabilities The Qwen3.5-35B-A3B is a Mixture-of-Experts (MoE) model with 35 billion total parameters and 3 billion active parameters, released by Alibaba in February 2026. It excels in efficiency, speed, reasoning, and coding due to its sparse activation design. However, it receives mixed reviews in community discussions for roleplay and creative writing tasks. Community comparisons often favor denser models such as the Qwen3.5-27B for superior character consistency, descriptive prose, and avoidance of repetitive or shallow output in roleplay and creative scenarios. These differences are attributed to the MoE architecture's limitation of active parameters, which can reduce depth and nuance in creative work.[^9][^1][^17] Performance ------------ +--- ### Benchmark results -Qwen3.5 models deliver competitive performance across a broad spectrum of benchmarks, frequently matching or exceeding leading proprietary models such as GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro in areas including reasoning, instruction following, coding, multimodal understanding, and agentic tasks.[^5][^15] In knowledge and reasoning benchmarks, the flagship Qwen3.5-397B-A17B achieves 87.8 on MMLU-Pro (comparable to GPT-5.2 at 87.4), 94.9 on MMLU-Redux, 93.0 on C-Eval, and 70.4 on SuperGPQA. It records 88.4 on GPQA and strong results in mathematical reasoning such as 91.3 on AIME26 and 80.9 on IMOAnswerBench.[^5] For instruction following and reliability, Qwen3.5-397B-A17B scores 92.6 on IFEval and leads with 76.5 on IFBench (surpassing GPT-5.2 at 75.4 and far ahead of Claude 4.5 Opus at 58.0). In long-context evaluation, it attains 63.2 on LongBench v2 (ahead of GPT-5.2 at 54.5). Smaller variants also perform robustly, with Qwen3.5-27B reaching 95.0 on IFEval, 76.5 on IFBench, and 86.1 on MMLU-Pro.[^5][^4][^15] In coding and agentic capabilities, Qwen3.5-397B-A17B scores 72.9 on BFCL-V4 (competitive with Gemini-3 Pro at 72.5) and up to 78.6 on BrowseComp (substantially ahead of Gemini-3 Pro at 59.2). It further achieves 72.9 on various tool-use and agent benchmarks. The Qwen3.5-27B variant records 72.4 on SWE-bench Verified and 80.7 on LiveCodeBench v6. Agentic search stands out as a particular strength for the family.[^5][^4][^15] Multimodal performance is notable, with native vision-language integration yielding high marks in visual and video understanding. Qwen3.5-27B scores 82.3 on MMMU, 86.0 on MathVision, 83.7 on RealWorldQA, and 87.0 on VideoMME (with subtitles). The family excels in document and embodied reasoning, with strong results in benchmarks such as OmniDocBench and ERQA, often on par with or surpassing frontier models in practical multimodal and agentic workflows.[^4][^15] As of February 2026, Qwen3.5, DeepSeek V3.2, and GLM-5 stand out as leading open-weight large language models from Chinese laboratories. GLM-5 frequently ranks first overall (e.g., Quality Index 49.64), excelling in reasoning and coding with a 200k-203k context window. DeepSeek V3.2 performs strongly in mathematical reasoning (AIME 2025: 92%), coding (LiveCodeBench: 86%), and MMLU-Pro (86%), with specialized variants like V3.2-Speciale topping certain leaderboards. Qwen3.5 models (various sizes, e.g., 27B/35B) are competitive in general knowledge (MMLU-Pro ~84.6% to 87.8% depending on variant), conversational abilities (Arena Elo ~1342), and coding, typically placing in the top 2-6 across leaderboards. Rankings vary by leaderboard (e.g., GLM-5 leads in some overall quality indices; DeepSeek variants excel in specialized coding/math; Qwen3.5 provides balanced performance), and no single model dominates all categories.[^18][^19][^16] As of February 2026, the MoE variant Qwen3.5-35B-A3B (35B total parameters, 3B active parameters) scores 37 on the Artificial Analysis Intelligence Index, achieving a high ranking. It performs competitively with Anthropic's Claude 4 series models (e.g., Claude 4 Sonnet, Claude Opus 4.5), outperforming Claude Sonnet 4.5 in MMMLU (knowledge) and MMMU-Pro (visual reasoning), while trailing Claude Opus 4.5 in data visualization tasks (163/200 vs 173/200). It provides faster output at 172.7 tokens/s with a 262k context window (compared to 1000k for Claude 4 Sonnet). As an open-source model, it offers significant advantages in cost-efficiency and suitability for local use.[^20][^21] These outcomes position Qwen3.5 as one of the most capable open-weight multimodal models, particularly in agentic and vision-language domains, while maintaining efficiency advantages over denser alternatives.[^5] +Qwen3.5 models deliver competitive performance across a broad spectrum of benchmarks, frequently matching or exceeding leading proprietary models such as GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro in areas including reasoning, instruction following, coding, multimodal understanding, and agentic tasks.[^5][^15] + +In knowledge and reasoning benchmarks, the flagship Qwen3.5-397B-A17B achieves 87.8 on MMLU-Pro (comparable to GPT-5.2 at 87.4), 94.9 on MMLU-Redux, 93.0 on C-Eval, and 70.4 on SuperGPQA. It records 88.4 on GPQA and strong results in mathematical reasoning such as 91.3 on AIME26 and 80.9 on IMOAnswerBench.[^5] + +For instruction following and reliability, Qwen3.5-397B-A17B scores 92.6 on IFEval and leads with 76.5 on IFBench (surpassing GPT-5.2 at 75.4 and far ahead of Claude 4.5 Opus at 58.0). In long-context evaluation, it attains 63.2 on LongBench v2 (ahead of GPT-5.2 at 54.5). Smaller variants also perform robustly, with Qwen3.5-27B reaching 95.0 on IFEval, 76.5 on IFBench, and 86.1 on MMLU-Pro.[^5][^4][^15] + +In coding and agentic capabilities, Qwen3.5-397B-A17B scores 72.9 on BFCL-V4 (competitive with Gemini-3 Pro at 72.5) and up to 78.6 on BrowseComp (substantially ahead of Gemini-3 Pro at 59.2). It further achieves 72.9 on various tool-use and agent benchmarks. The Qwen3.5-27B variant records 72.4 on SWE-bench Verified and 80.7 on LiveCodeBench v6. Agentic search stands out as a particular strength for the family.[^5][^4][^15] + +Multimodal performance is notable, with native vision-language integration yielding high marks in visual and video understanding. Qwen3.5-27B scores 82.3 on MMMU, 86.0 on MathVision, 83.7 on RealWorldQA, and 87.0 on VideoMME (with subtitles). The family excels in document and embodied reasoning, with strong results in benchmarks such as OmniDocBench and ERQA, often on par with or surpassing frontier models in practical multimodal and agentic workflows.[^4][^15] + +As of February 2026, Qwen3.5, DeepSeek V3.2, and GLM-5 stand out as leading open-weight large language models from Chinese laboratories. GLM-5 frequently ranks first overall (e.g., Quality Index 49.64), excelling in reasoning and coding with a 200k-203k context window. DeepSeek V3.2 performs strongly in mathematical reasoning (AIME 2025: 92%), coding (LiveCodeBench: 86%), and MMLU-Pro (86%), with specialized variants like V3.2-Speciale topping certain leaderboards. Qwen3.5 models (various sizes, e.g., 27B/35B) are competitive in general knowledge (MMLU-Pro ~84.6% to 87.8% depending on variant), conversational abilities (Arena Elo ~1342), and coding, typically placing in the top 2-6 across leaderboards. Rankings vary by leaderboard (e.g., GLM-5 leads in some overall quality indices; DeepSeek variants excel in specialized coding/math; Qwen3.5 provides balanced performance), and no single model dominates all categories.[^18][^19][^16] + +As of February 2026, the MoE variant Qwen3.5-35B-A3B (35B total parameters, 3B active parameters) scores 37 on the Artificial Analysis Intelligence Index, achieving a high ranking. It performs competitively with Anthropic's Claude 4 series models (e.g., Claude 4 Sonnet, Claude Opus 4.5), outperforming Claude Sonnet 4.5 in MMMLU (knowledge) and MMMU-Pro (visual reasoning), while trailing Claude Opus 4.5 in data visualization tasks (163/200 vs 173/200). It provides faster output at 172.7 tokens/s with a 262k context window (compared to 1000k for Claude 4 Sonnet). As an open-source model, it offers significant advantages in cost-efficiency and suitability for local use.[^20][^21] + +These outcomes position Qwen3.5 as one of the most capable open-weight multimodal models, particularly in agentic and vision-language domains, while maintaining efficiency advantages over denser alternatives.[^5] ### Efficiency and inference -Qwen3.5 models achieve superior inference efficiency through a combination of architectural innovations and targeted optimizations. The hybrid architecture, which integrates linear attention via Gated Delta Networks with a sparse Mixture-of-Experts (MoE) design, enables high decoding throughput while reducing computational and memory demands.[^3] The sparse MoE approach in models such as Qwen3.5-397B-A17B activates only 17 billion parameters per forward pass out of a total 397 billion, significantly lowering the cost of inference without compromising capability. This design supports cost-effective deployment, particularly for large-scale or resource-constrained environments.[^3] Similarly, the Qwen3.5-35B-A3B model (35 billion total parameters, 3 billion active) employs a sparse MoE design to achieve an output speed of 172.7 tokens per second and a 262,000 token context window, enabling high inference efficiency. This performance is competitive with leading proprietary models, offering faster output speeds in many scenarios while supporting substantial context lengths suitable for complex tasks.[^20] Compared to Anthropic's Claude 4 Sonnet, which provides a larger 1,000,000 token context window, the Qwen3.5-35B-A3B model delivers faster output throughput alongside cost advantages. As an open-weights model released under the Apache 2.0 license, it excels in cost-efficient local and on-premise deployments compared to proprietary alternatives.[^22][^20] Community reports highlight the Qwen3.5-27B variant as a particularly efficient and high-performing model. It features the hybrid architecture and a 262,000 token context window, delivering strong results in coding, reasoning, and output consistency, often outperforming larger models such as the Qwen3.5-35B-A3B in practical tasks like long-form code generation and complex problem-solving. This model has been adopted for local workflows due to its privacy benefits—no external data transmission is required—and unlimited usage without token limits or service restrictions. Inference efficiency permits practical local execution on consumer hardware, including 5-bit quantization on an NVIDIA RTX 3090 (24 GB VRAM) at usable speeds. Newer GPUs like the RTX 5090 (32 GB VRAM) support models up to approximately 40 billion parameters comfortably. Smaller variants of Qwen3.5 (e.g., Qwen3.5-0.8B, 2B, 4B, 9B) and quantized versions (e.g., GPTQ-Int4) of mid-sized models can run effectively on GPUs with 8 GB VRAM, such as the NVIDIA RTX 4060. In contrast, larger MoE models (e.g., Qwen3.5-397B-A17B) require significantly more VRAM and are unlikely to run effectively on 8 GB hardware. Common inference frameworks for these setups include vLLM, where GPTQ quantization has demonstrated faster performance than AWQ, and TensorRT-LLM.[^23][^24] However, community discussions on Reddit's r/LocalLLaMA subreddit indicate that pure CPU-only inference for the Qwen3.5-27B remains slow, with one user measuring 1.9 tokens per second using Q6 quantization on CPU (likely via llama.cpp). Switching to Q4\_K\_M quantization with 55 GPU layers offloaded improved performance to 7.3 tokens per second for simple prompts. No direct benchmarks for pure CPU (no GPU offload) with Q4\_K\_M quantization were found in the discussions.[^25] Decoding throughput is markedly improved relative to prior generations. For the Qwen3.5-397B-A17B variant, throughput reaches 8.6 times that of Qwen3-Max under a 32k context length and 19.0 times that of Qwen3-Max under a 256k context length, while delivering comparable performance. Compared to Qwen3-235B-A22B, the same model offers 3.5 times higher throughput at 32k context and 7.2 times higher at 256k context.[^3] A native FP8 pipeline applies low-precision computations to activations, MoE routing, and key operations, yielding approximately 50% reduction in activation memory and over 10% speedup. Runtime monitoring preserves higher precision in sensitive layers to ensure stability during long scaling runs.[^3] Additionally, a scalable asynchronous reinforcement learning (RL) framework tailored for Qwen3.5 supports text, multimodal, and multi-turn scenarios, delivering 3× to 5× end-to-end speedups through techniques including FP8 end-to-end training, rollout router replay, speculative decoding, and multi-turn rollout locking. This framework further enhances hardware utilization, dynamic load balancing, and training stability.[^3] Developer demonstrations have shown that the Qwen3.5-2B model achieves notable on-device inference efficiency on mobile hardware. It runs smoothly on recent iPhones, such as the iPhone 17 Pro, using Apple's MLX framework with 6-bit quantization to fit within device RAM (e.g., 12 GB) and leveraging the Neural Engine for accelerated inference. MLX provides up to 2x speed gains over generic frameworks. Similar advantages extend to larger models such as the Qwen3.5-9B on Apple Silicon Macs, where MLX implementations generally provide significantly faster inference than llama.cpp-based (GGUF) setups, with reports of up to 2x faster token generation and 21-87% higher throughput due to MLX's optimization for unified memory and Apple hardware. No direct head-to-head speed comparison for Qwen3.5-9B specifically on Mac M1 exists in reliable sources, and specific benchmarks on older M1 hardware are limited, with most focusing on newer M-series chips. The model supports toggleable reasoning modes, including a Non-Thinking mode for rapid responses to simple tasks and a Thinking mode for complex problems involving hidden reasoning tokens. According to the official model card for Qwen3.5-397B-A17B, the recommended sampling parameters are temperature 0.6 for thinking mode (with top\_p=0.95, top\_k=20) and temperature 0.7 for instruct (non-thinking) mode (with top\_p=0.8, top\_k=20, presence\_penalty=1.5). These are provided as recommended parameters for optimal generation, with support varying by inference framework.[^8] and outperforms larger models in logic and mathematics tasks due to its Gated DeltaNet hybrid architecture. Specific quantitative benchmarks, such as tokens per second, are not reported in available sources.[^26][^27] +Qwen3.5 models achieve superior inference efficiency through a combination of architectural innovations and targeted optimizations. The hybrid architecture, which integrates linear attention via Gated Delta Networks with a sparse Mixture-of-Experts (MoE) design, enables high decoding throughput while reducing computational and memory demands.[^3] + +The sparse MoE approach in models such as Qwen3.5-397B-A17B activates only 17 billion parameters per forward pass out of a total 397 billion, significantly lowering the cost of inference without compromising capability. This design supports cost-effective deployment, particularly for large-scale or resource-constrained environments.[^3] + +Similarly, the Qwen3.5-35B-A3B model (35 billion total parameters, 3 billion active) employs a sparse MoE design to achieve an output speed of 172.7 tokens per second and a 262,000 token context window, enabling high inference efficiency. This performance is competitive with leading proprietary models, offering faster output speeds in many scenarios while supporting substantial context lengths suitable for complex tasks.[^20] + +Compared to Anthropic's Claude 4 Sonnet, which provides a larger 1,000,000 token context window, the Qwen3.5-35B-A3B model delivers faster output throughput alongside cost advantages. As an open-weights model released under the Apache 2.0 license, it excels in cost-efficient local and on-premise deployments compared to proprietary alternatives.[^22][^20] + +Community reports highlight the Qwen3.5-27B variant as a particularly efficient and high-performing model. It features the hybrid architecture and a 262,000 token context window, delivering strong results in coding, reasoning, and output consistency, often outperforming larger models such as the Qwen3.5-35B-A3B in practical tasks like long-form code generation and complex problem-solving. This model has been adopted for local workflows due to its privacy benefits—no external data transmission is required—and unlimited usage without token limits or service restrictions. Inference efficiency permits practical local execution on consumer hardware, including 5-bit quantization on an NVIDIA RTX 3090 (24 GB VRAM) at usable speeds. Newer GPUs like the RTX 5090 (32 GB VRAM) support models up to approximately 40 billion parameters comfortably. Smaller variants of Qwen3.5 (e.g., Qwen3.5-0.8B, 2B, 4B, 9B) and quantized versions (e.g., GPTQ-Int4) of mid-sized models can run effectively on GPUs with 8 GB VRAM, such as the NVIDIA RTX 4060. In contrast, larger MoE models (e.g., Qwen3.5-397B-A17B) require significantly more VRAM and are unlikely to run effectively on 8 GB hardware. Common inference frameworks for these setups include vLLM, where GPTQ quantization has demonstrated faster performance than AWQ, and TensorRT-LLM.[^23][^24] + +However, community discussions on Reddit's r/LocalLLaMA subreddit indicate that pure CPU-only inference for the Qwen3.5-27B remains slow, with one user measuring 1.9 tokens per second using Q6 quantization on CPU (likely via llama.cpp). Switching to Q4\_K\_M quantization with 55 GPU layers offloaded improved performance to 7.3 tokens per second for simple prompts. No direct benchmarks for pure CPU (no GPU offload) with Q4\_K\_M quantization were found in the discussions.[^25] + +Decoding throughput is markedly improved relative to prior generations. For the Qwen3.5-397B-A17B variant, throughput reaches 8.6 times that of Qwen3-Max under a 32k context length and 19.0 times that of Qwen3-Max under a 256k context length, while delivering comparable performance. Compared to Qwen3-235B-A22B, the same model offers 3.5 times higher throughput at 32k context and 7.2 times higher at 256k context.[^3] + +A native FP8 pipeline applies low-precision computations to activations, MoE routing, and key operations, yielding approximately 50% reduction in activation memory and over 10% speedup. Runtime monitoring preserves higher precision in sensitive layers to ensure stability during long scaling runs.[^3] + +Additionally, a scalable asynchronous reinforcement learning (RL) framework tailored for Qwen3.5 supports text, multimodal, and multi-turn scenarios, delivering 3× to 5× end-to-end speedups through techniques including FP8 end-to-end training, rollout router replay, speculative decoding, and multi-turn rollout locking. This framework further enhances hardware utilization, dynamic load balancing, and training stability.[^3] + +Developer demonstrations have shown that the Qwen3.5-2B model achieves notable on-device inference efficiency on mobile hardware. It runs smoothly on recent iPhones, such as the iPhone 17 Pro, using Apple's MLX framework with 6-bit quantization to fit within device RAM (e.g., 12 GB) and leveraging the Neural Engine for accelerated inference. MLX provides up to 2x speed gains over generic frameworks. Similar advantages extend to larger models such as the Qwen3.5-9B on Apple Silicon Macs, where MLX implementations generally provide significantly faster inference than llama.cpp-based (GGUF) setups, with reports of up to 2x faster token generation and 21-87% higher throughput due to MLX's optimization for unified memory and Apple hardware. No direct head-to-head speed comparison for Qwen3.5-9B specifically on Mac M1 exists in reliable sources, and specific benchmarks on older M1 hardware are limited, with most focusing on newer M-series chips. The model supports toggleable reasoning modes, including a Non-Thinking mode for rapid responses to simple tasks and a Thinking mode for complex problems involving hidden reasoning tokens. According to the official model card for Qwen3.5-397B-A17B, the recommended sampling parameters are temperature 0.6 for thinking mode (with top\_p=0.95, top\_k=20) and temperature 0.7 for instruct (non-thinking) mode (with top\_p=0.8, top\_k=20, presence\_penalty=1.5). These are provided as recommended parameters for optimal generation, with support varying by inference framework.[^8] and outperforms larger models in logic and mathematics tasks due to its Gated DeltaNet hybrid architecture. Specific quantitative benchmarks, such as tokens per second, are not reported in available sources.[^26][^27] Availability ------------ ### Open-source release -The Qwen3.5 series was released as open-weight models under the Apache 2.0 license, with the initial release of the Qwen3.5-397B-A17B model on February 16, 2026, followed by the Qwen3.5 Medium Model series—including the open-source Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-27B (plus a base variant)—on February 25, 2026, with the proprietary Qwen3.5-Flash available via API.[^1] Model weights and configuration files are available for download from the Hugging Face Hub under the Qwen organization[^28] and from ModelScope, where users in regions with restricted Hugging Face access can obtain them by setting appropriate environment variables or using direct download commands.[^29] In addition, Unsloth Dynamic (UD) GGUF quantized versions of the Qwen3.5-35B-A3B model, such as UD-Q4\_K\_XL and UD-Q3\_K\_XL, are available on Hugging Face under the unsloth organization. These quantizations offer optimized trade-offs between accuracy and model size and are designed for efficient local inference using llama.cpp.[^30] The models can be accessed interactively through the Qwen Chat web interface at chat.qwen.ai, which supports multiple response modes, or via the Alibaba Cloud Model Studio API, which offers OpenAI-compatible endpoints.[^16] Additionally, the Qwen3.5-27B model is available for interactive use on HuggingChat.[^31] For local inference and deployment, Qwen3.5 supports a range of frameworks including Hugging Face Transformers for general use, vLLM and SGLang for high-throughput serving with OpenAI-compatible APIs, llama.cpp for efficient CPU/GPU inference, and MLX for Apple Silicon optimization.[^1] +The Qwen3.5 series was released as open-weight models under the Apache 2.0 license, with the initial release of the Qwen3.5-397B-A17B model on February 16, 2026, followed by the Qwen3.5 Medium Model series—including the open-source Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-27B (plus a base variant)—on February 25, 2026, with the proprietary Qwen3.5-Flash available via API.[^1] + +Model weights and configuration files are available for download from the Hugging Face Hub under the Qwen organization[^28] and from ModelScope, where users in regions with restricted Hugging Face access can obtain them by setting appropriate environment variables or using direct download commands.[^29] In addition, Unsloth Dynamic (UD) GGUF quantized versions of the Qwen3.5-35B-A3B model, such as UD-Q4\_K\_XL and UD-Q3\_K\_XL, are available on Hugging Face under the unsloth organization. These quantizations offer optimized trade-offs between accuracy and model size and are designed for efficient local inference using llama.cpp.[^30] + +The models can be accessed interactively through the Qwen Chat web interface at chat.qwen.ai, which supports multiple response modes, or via the Alibaba Cloud Model Studio API, which offers OpenAI-compatible endpoints.[^16] Additionally, the Qwen3.5-27B model is available for interactive use on HuggingChat.[^31] + +For local inference and deployment, Qwen3.5 supports a range of frameworks including Hugging Face Transformers for general use, vLLM and SGLang for high-throughput serving with OpenAI-compatible APIs, llama.cpp for efficient CPU/GPU inference, and MLX for Apple Silicon optimization.[^1] ### Platform integrations -Qwen3.5 models support seamless integration across several popular open-source frameworks for local deployment and inference. They are fully compatible with the Hugging Face Transformers library, which enables direct inference, lightweight server deployment, and the creation of OpenAI-compatible API endpoints.[^1][^9] High-performance inference is available through engines such as vLLM and SGLang, both of which provide efficient, memory-optimized serving with OpenAI-compatible APIs and support for advanced features like long context handling up to 262144 tokens (or extended via RoPE scaling).[^1][^9] Compatibility also extends to KTransformers for optimized inference using CPU-GPU heterogeneous computing.[^9] Hosted API access is provided via Alibaba Cloud Model Studio, which offers first-class support for Qwen3.5 models through OpenAI-compatible interfaces, including Chat Completions and Responses APIs, with the hosted Qwen3.5-Flash variant (corresponding to Qwen3.5-35B-A3B) featuring a 1 million token context length by default and built-in production tools.[^1][^9] The models integrate with agent frameworks such as Qwen-Agent, which uses OpenAI-compatible endpoints to enable capabilities like tool calling, planning, memory management, and multi-step task execution.[^32][^1] Qwen3.5 models also support on-device inference on Apple hardware via the MLX framework. Demonstrations show the Qwen3.5-2B model, quantized to 6 bits, running efficiently on recent iPhones (such as the iPhone 17 Pro), fitting within device RAM (e.g., 12 GB) and leveraging the Neural Engine for accelerated inference. MLX provides up to 2× speed gains over generic frameworks. The model runs smoothly, supports toggleable reasoning modes (Thinking and Non-Thinking), and exhibits strong performance in logic and mathematics tasks attributed to its Gated DeltaNet hybrid architecture.[^1][^26] +Qwen3.5 models support seamless integration across several popular open-source frameworks for local deployment and inference. They are fully compatible with the Hugging Face Transformers library, which enables direct inference, lightweight server deployment, and the creation of OpenAI-compatible API endpoints.[^1][^9] + +High-performance inference is available through engines such as vLLM and SGLang, both of which provide efficient, memory-optimized serving with OpenAI-compatible APIs and support for advanced features like long context handling up to 262144 tokens (or extended via RoPE scaling).[^1][^9] + +Compatibility also extends to KTransformers for optimized inference using CPU-GPU heterogeneous computing.[^9] Hosted API access is provided via Alibaba Cloud Model Studio, which offers first-class support for Qwen3.5 models through OpenAI-compatible interfaces, including Chat Completions and Responses APIs, with the hosted Qwen3.5-Flash variant (corresponding to Qwen3.5-35B-A3B) featuring a 1 million token context length by default and built-in production tools.[^1][^9] + +The models integrate with agent frameworks such as Qwen-Agent, which uses OpenAI-compatible endpoints to enable capabilities like tool calling, planning, memory management, and multi-step task execution.[^32][^1] + +Qwen3.5 models also support on-device inference on Apple hardware via the MLX framework. Demonstrations show the Qwen3.5-2B model, quantized to 6 bits, running efficiently on recent iPhones (such as the iPhone 17 Pro), fitting within device RAM (e.g., 12 GB) and leveraging the Neural Engine for accelerated inference. MLX provides up to 2× speed gains over generic frameworks. The model runs smoothly, supports toggleable reasoning modes (Thinking and Non-Thinking), and exhibits strong performance in logic and mathematics tasks attributed to its Gated DeltaNet hybrid architecture.[^1][^26] ### Ollama support