3月26日,被誉为 阿里最年轻P10 的千问(Qwen)大模型灵魂人物林俊旸,在月初离职风波舆论渐息之际(详见《阿里千问 换帅 风暴:林俊旸转身,大模型赛道告别英雄时代》),在X平台发布长文《从 推理式思考 到 智能体式思考 》,系统阐述了他对AI技术范式演进剖析。通过这篇文章,林俊旸不仅总结了过去,更指向了AI未来竞争的焦点 一个超越单一模型比拼、关乎系统、环境与协同的智能体新时代。
林俊旸将2024-2025年定义为 推理思考 阶段,以OpenAI o1和DeepSeek-R1为代表,其核心成就是证明了 思考 可以作为一种可训练、可交付的一流能力。这一阶段的本质,是通过强化学习(RL)在数学、代码等可验证领域获得确定性反馈,从而让模型 为正确而优化,而非为合理 。然而,这背后是巨大的基础设施挑战 推理RL已从轻量级微调附件,演变为需要大规模部署、高吞吐验证的系统工程问题。
不过,真正的难题远不止于此。文章第二部分探讨了 思考模式 与 指令模式 融合的实践困境。这一分析也映照了商业现实:阿里在Qwen3尝试融合后,后续的2507版本中Instruct与Thinking版本独立呈现,因为大量客户在批量操作中仍需要高性价比、高可控的指令行为。
文章提出 智能体式思考 (Agentic Thinking)是下一代AI的核心范式。这标志着训练核心从模型本身转向 模型-环境 系统。智能体思维的核心是 为行动而思考 ,它必须处理纯推理模型无需面对的难题:决定何时行动、调用何种工具、处理环境的不确定反馈、在失败后修订计划、在多轮交互中保持连贯。
林俊旸认为,在推理时代,优势源于更好的RL算法和反馈信号;而在智能体时代,竞争优势将建立在更优质的环境设计、更紧密的训练-服务一体化架构、以及更强大的智能体协同工程之上。环境本身成为一等品,其稳定性、真实性、反馈丰富度和抗过拟合能力至关重要。同时,多智能体组织架构 由规划者、领域专家和执行子代理构成的系统 将成为核心智能的来源。
这篇文章探讨的内容并非前沿式创新,但发布后仍引发了大量关注。对更多人而言,好奇点在于:曾经的Qwen大模型核心负责人如何理解当下AI技术发展。或许,这也暗示了他看好的下一个创业或研究方向。
![]()
全文由千问Qwen翻译:
From "Reasoning" Thinking to "Agentic" Thinking
从 推理式思考 到 智能体式思考
The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.过去两年重塑了我们评估模型的方式以及对模型的期望。OpenAI的o1证明, 思考 可以成为一种一流的技能 一种需要专门训练并面向用户开放的能力。DeepSeek-R1则表明,推理风格的后训练方法不仅能在原始实验室之外重现,还能实现规模化应用。OpenAI将o1描述为一种通过强化学习训练而成的模型,它能够在回答问题前 先进行思考 。DeepSeek则将R1定位为一款与o1相媲美的开放式推理模型。
That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.
那个阶段很重要。但2025年上半年主要聚焦于推理思维:如何让模型在推理时花费更多时间。计算,如何用更强烈的奖励来训练它们,如何暴露或控制那种额外的推理努力。现在的问题是:接下来该怎么做?我认为答案是代理思维:即思考 为了 在与环境互动时采取行动,并根据来自外界的反馈不断更新计划。
1. What the Rise of o1 and R1 Actually Taught Us
o1和R1的崛起实际上教会了我们什么
The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.第一波推理模型告诉我们,若想在语言模型中规模化应用强化学习,我们就需要具备确定性、稳定性和可扩展性的反馈信号。数学、代码、逻辑及其他可验证的领域因此成为核心,因为在这些场景中,奖励信号远比一般的偏好监督更为有力。它们使强化学习能够专注于追求正确性,而非仅仅追求合理性。与此同时,基础设施也变得至关重要。
Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.
一旦模型经过训练能够推理更长的轨迹,强化学习便不再只是监督微调的一个轻量级附加组件。它 变成 一个系统性问题。你需要大规模部署、高吞吐量验证、稳定的策略更新以及高效的采样。推理模型的出现,其背后既涉及基础设施建设,也关乎建模本身。OpenAI 将 o1 描述为一种通过强化学习训练的推理模型,而 DeepSeek R1 后来进一步印证了这一方向,展示了 多少 针对基于推理的强化学习,需要专门的算法和基础设施工作。第一次重大转变:从扩大预训练规模转向扩大后训练规模以实现推理能力。
2. The Real Problem Was Never Just "Merge Thinking and Instruct"
真正的问题从来不仅仅是 融合思考与指令
At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.2025年初,我们Qwen团队的许多成员心中都描绘了一幅雄心勃勃的蓝图。理想的系统是将实现思维与指令模式统一,并支持可调节的推理力度,其理念类似于低/中/高三种推理设置。更棒的是,该系统能够根据提示和上下文自动推断出恰当的推理量:模型既能即时作答,也能选择深入思考,甚至在面对真正棘手的问题时,投入更多计算资源进行细致求解。
Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.
从概念上讲,这是正确的方向。Qwen3是最清晰的公开尝试之一。它引入了 混合思考模式 ,在一个模型家族中同时支持思考和非思考行为,强调可控的思考预算,并描述了一个明确包含 思考模式融合 的四阶段后训练流程,该流程位于长思维链冷启动和推理强化学习之后。
But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.
但融合比良好执行更容易描述。困难的部分是数据。当人们谈论融合思考与指令时,他们通常首先想到的是模型侧的兼容性:一个检查点能否同时支持两种模式,一个聊天模板能否在它们之间切换,一个服务栈能否暴露正确的切换开关。更深层的问题是,这两种模式的数据分布和行为目标存在本质差异。
We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.
我们在尝试平衡模型合并与提升训练后数据的质量和多样性时,并未完全做到尽善尽美。在这一修订过程中,我们还密切关注了用户如何实际参与具备思考与指导两种模式。在企业级任务中,例如重写、标注、模板化支持、结构化提取以及运营质量保证等重复性高、工作量大的场景,表现强劲的指导模型通常因其直接性、简洁性、格式合规性以及低延迟而受到青睐。而表现强劲的思考模型则因在解决难题时消耗更多标记、保持连贯的中间结构、探索多种备选路径,并保留足够的内部计算以切实提升最终结果的正确性而备受推崇。
These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.
这两种行为模式相互抵消。如果对合并后的数据不加以精心筛选,最终结果往往两头不讨好:所谓的 思考 型行为变得杂乱无章、臃肿不堪,或缺乏足够的决断力;而 指令 型行为则变得不够干脆利落、可靠性降低,且成本高于商业用户的需求。实际上想要。
Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.
分离在实践中仍颇具吸引力。2025年晚些时候,在Qwen3最初的混合框架之后,2507版本推出了独立的Instruct和Thinking更新版本,其中包括分别针对30B和235B参数量的变体。在商业部署中,大量客户仍然希望在批量操作中实现高吞吐、低成本且高度可操控的指令行为。对于这些场景,合并显然并不具备优势。将各条线分开,能让团队更清晰地专注于解决每种模式的数据和训练问题。
Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.
其他实验室则选择了截然不同的路径。Anthropic公开倡导一种集成式模型理念:Claude 3.7 Sonnet被定位为一种混合推理模型,用户可选择普通回复或深度思考模式,API用户还可设定思考预算。Anthropic明确表示,他们认为推理应当是一种集成化的能力,而非独立的模型。GLM-4.5同样公开将自身定位为一种混合推理模型,兼具思考与非思考两种模式,实现了推理、编码及智能体能力的统一;DeepSeek随后也朝着类似方向迈进,其V3.1版本推出了 思考与非思考 混合推理功能。
The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.
关键问题在于,这种融合是否是自然有机的。如果思维与指令仅仅被安置于同一个检查点内,却仍表现为两种生硬拼接的个性,那么产品的用户体验将依然显得不自然。真正成功的融合,需要实现推理努力的平滑连续变化。模型应当能够表达不同层次的推理强度,并且最好能自适应地在这些层次之间做出选择。GPT式的努力控制正朝着这一方向迈进:它采用的是对计算资源的策略性调控,而非简单的二元开关。
3. Why Anthropic's Direction Was a Useful Corrective
为什么Anthropic的方针是一种有益的纠正措施
Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.Anthropic围绕Claude 3.7和Claude 4的公开表述是克制的。他们着重强调了整合推理、用户可控的思维预算、真实世界任务、代码质量,以及后期在长时间思考过程中使用工具的能力。Claude 3.7被定位为一种具备可控预算的混合推理模型;Claude 4则在此基础上进一步拓展,允许推理与工具使用相互交织。与此同时,Anthropic还特别强调了编码、长期运行任务以及智能体工作流作为其主要目标。
Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.
生成更长的推理轨迹并不会自动使模型变得更智能。在许多情况下,过多的显式推理信号反而会导致分配效率低下。如果模型试图以同样冗长的方式对所有内容进行推理,它很可能无法合理 prioritization,无法有效压缩,也无法采取行动。人类的 轨迹表明,一种更严谨的视角更为恰当:思考应以目标工作量为导向。如果目标是编写代码,那么思考就应有助于代码库导航、规划、分解、错误恢复以及工具编排。如果目标是代理工作流,那么思考的重点应放在提升长期执行质量上,而非追求令人惊艳的中间成果。
This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is