From Reasoning to Innovation: How AI Is Poised to Solve Humanity’s Hardest Problems

On 17 May 2025, following his foundational lecture on AI Agents and Intelligent Automation, Mr. Shek Ka-wai, founder of OMP and generative AI expert, continued to engage PolyU SPEED’s students and alumni with a forward-looking session on the transformative potential of AI as an innovator.

Building on OpenAI’s five-level AI maturity framework—from Chatbots to Organizational AI—Mr. Shek focused this later session on the fourth level: Innovation. Here, AI systems evolve beyond tool usage and reasoning into domains of true scientific and technological discovery. Drawing on insights from Anthropic and DeepMind, he demonstrated how reinforcement learning (RL) enables AI not just to mimic human intelligence, but to surpass it through experience-driven experimentation and autonomous learning.


A Turning Point: AI Innovation in Medicine and Science

Mr. Shek cited interviews from the CEOs of Anthropic and DeepMind to illustrate how the AI field is on the cusp of medical breakthroughs. Anthropic’s CEO predicted that AI may soon extend human lifespans, while DeepMind’s CEO, Demis Hassabis—also a 2024 Nobel Prize winner—has forecast that within the next decade, AI could enable cures for cancer and many other diseases.

One of the most promising examples is DeepMind’s AlphaFold, which uses AI to predict how proteins fold inside the human body. Protein folding plays a crucial role in understanding biological structures, which in turn is essential for developing new treatments and medicines. This problem had stumped scientists for decades, often requiring years of lab research. Now, AI can perform such predictions within hours, revolutionizing drug discovery and molecular biology.

Mr. Shek connected this breakthrough to a broader transformation within AI research. He referred to AlphaGo, DeepMind’s earlier milestone AI, which defeated Go world champion Lee Sedol 4–1 in 2016. AlphaGo was trained using supervised learning from hundreds of thousands of human game records. But DeepMind didn’t stop there.

They went on to develop AlphaGo Zero, an even more powerful system that started with zero human data. It learned purely through reinforcement learning by playing against itself. Within 21 days, AlphaGo Zero surpassed the original AlphaGo and defeated it 100–0. AlphaGo Zero was groundbreaking in its ability to derive strategic understanding without relying on human examples.

This shift—from supervised to reinforcement learning—underscored a powerful lesson: when AI learns from the environment and experience, it can surpass even the best human-designed strategies. AlphaGo Zero was a defining moment that proved AI could generate original strategies and surpass legacy knowledge by learning directly from raw feedback.

Learning Through Experience: Supervised Fine-Tuning vs Reinforcement Learning

To help the audience visualize this, Mr. Shek offered a vivid analogy. Imagine two methods to train a student to become a world-class scientist:

Supervised Fine-Tuning involves training a model on curated, labeled datasets with the correct outputs. This method enables models to closely replicate human-like responses based on known answers. In educational terms, it’s like giving students past exam papers to practice—great for performance, but limited in encouraging creative thinking.

Reinforcement Learning, on the other hand, places an AI in an environment where it must act and receive feedback. The model learns by trial and error, guided by rewards or penalties from its outcomes. It’s like placing a student in a lab to conduct open-ended experiments, learning through experimentation rather than memorization.

Mr. Shek emphasized this by reflecting on his own entrepreneurial journey: “Intelligence may be innate, but wisdom is earned through failure,” he said. He noted that true innovation only emerges through firsthand interaction with the unknown—where feedback isn’t marked by correctness, but by results.

To reinforce this, he spotlighted DeepSeek’s R1 Zero model, trained without human labels. Like AlphaGo Zero, it relied on environmental interaction to self-correct and iterate toward the correct solution—proving that AI can develop novel strategies without human bias or oversight.

Why Current LLMs Can’t Truly Innovate

Mr. Shek highlighted a key limitation of current LLMs: their dependency on human preferences for training. While RLHF (Reinforcement Learning from Human Feedback) is commonly used, it’s inherently subjective.

For instance, when training a model to write articles or generate summaries, there's no definitive "correct" output. In these cases, models are shown multiple completions, and human annotators choose the best one. The model then updates itself to align with those preferences. But what happens when those preferences are wrong—or simply average?

That’s why, Mr. Shek argued, LLMs reflect human biases and are optimized for average correctness, not creativity or innovation.

Escaping Path Dependency Through First Principles

Mr. Shek turned to path dependency—how outdated practices persist. He gave an engaging historical example:

Why is a NASA rocket booster the width it is today? The answer, surprisingly, traces back to Roman chariots. The width of U.S. railways was based on English tramways, which were based on horse-drawn carriages, which were based on chariot ruts from the Roman Empire.

This kind of legacy thinking stifles innovation. He compared it to the QWERTY keyboard, originally designed to slow down typing and avoid jamming in typewriters. Although obsolete, we still use it today.

AI, on the other hand, can break free from these constraints. If trained via RL and exposed to diverse experimental outcomes, it can apply first-principles reasoning—solving problems based on core truths, not tradition.

OpenAI’s Hide-and-Seek and the Rise of Emergent Behavior

To show how RL leads to surprising results, Mr. Shek revisited OpenAI’s famous hide-and-seek experiment. In this simulation:

  • Blue agents (hiders) and red agents (seekers) played tag in a simulated environment.
  • Through millions of iterations, hiders learned to lock doors and move ramps to escape.
  • Seekers then learned to use ramps as launchpads or ride boxes like skateboards—behaviors the developers never coded.

This emergent behavior proves that AI can learn tactical creativity when the environment allows trial, error, and feedback.

When AI Starts to Reinvent Learning Itself

Perhaps the most forward-looking concept Mr. Shek introduced was meta-RL: AI learning how to improve its own learning.

He cited the work of Derek Silver and Richard Sutton, who introduced the idea of AI systems that design their own reinforcement learning algorithms. In trials, these systems outperformed all human-designed RL algorithms—a landmark shift toward self-optimizing AI.

This means AI isn’t just learning faster—it’s learning how to learn better than humans ever could.

Conclusion: The Road to AI-Driven Discovery

Mr. Shek concluded by asking the audience a fundamental question: What limits human innovation?

He identified two barriers:

  • Time – Human lives are finite; AI can test millions of ideas in parallel.
  • Bias – Humans carry legacy thinking, emotional bias, and rigid rules.

In contrast, AI models trained via RL are free from ego, tireless, and open to failure. With the right environment and feedback mechanisms, they can discover new medical treatments, reimagine infrastructure, and even rewrite scientific theory.

He left the students with a clear vision: the future of innovation lies not in data alone, but in experience-rich learning systems that challenge assumptions and learn from the world itself.

— Dr. Ken FONG

中文摘要

2025年5月17日,香港理工大學專業進修學院(PolyU SPEED)於《AI-Driven Digital and Social Media Marketing》專業證書課程中,再次邀請 OMP 創辦人暨生成式 AI 專家石家瑋先生(Mr. Shek Ka-wai)主講,本篇文章延續上半場「從 Gen AI 到 AI Agent」的主題,記錄下半場講座,探討 AI 作為「創新者」的潛力與實踐。

上半場石Sir 以 OpenAI 所提出的 AI 五個成熟階段為架構,指出我們目前已跨入第四階段——創新型 AI(Innovator)。這一階段的 AI,不僅能使用工具與推理,更具備自主研發與創新科學理論的能力。這場講座圍繞 AI 如何結合「強化學習(Reinforcement Learning, RL)」走向人類難以企及的創新高度。

石Sir 引用 Anthropic 及 DeepMind CEO 的訪問指出:AI 有望在十年內協助延長人類壽命,甚至治癒癌症與各種疾病。這並非科幻,而是正在發生的科技突破。DeepMind 的 AlphaFold 是其中典範,能精準預測蛋白質摺疊,過去需花上數年時間的生物研究,現在 AI 能在數小時內完成,大幅加速藥物研發與分子醫學。

AlphaGo 是透過「監督式微調(Supervised Fine-Tuning)」學習人類棋局,但 AlphaGo Zero 完全不依賴人類資料,只透過與自身對弈進行強化學習,在短短 21 天內超越前代,並以 100 比 0 擊敗 AlphaGo。

這證明 AI 透過與環境互動的經驗學習,不但能複製人類智慧,更能超越並創造新知。若用強化學習訓練,AI 可擁有「第一原理思考」的能力——不依賴傳統,而是依物理或邏輯本質解決問題。這證明 AI 透過與環境互動的經驗學習,不但能複製人類智慧,更能超越並創造新知。

SFT 與 RL 的對比:操卷與實驗室的啟示

為幫助學生理解,石Sir 用一個生活化比喻說明 SFT 與 RL 的差異:

Supervised Fine-Tuning:如同提供歷屆試題給學生操卷,強調熟練與正確答案,但不鼓勵創新。

Reinforcement Learning:則如同讓學生自由使用實驗室設備,從試驗錯誤中學習。這種模式更能訓練真正的創造力與解決問題能力。

石Sir 也提到他創業時聽到的一句話:「聰明是天生的,智慧是從錯誤中習得的。」AI 若要協助人類創新,必須能不斷從環境獲取回饋,自我調整。

DeepSeek R1 Zero:不依賴人類標註的推理模型

他介紹了 2025 年推出的 DeepSeek R1 Zero 模型,與 AlphaGo Zero 同樣不使用人類數據訓練,而是讓 AI 自行嘗試、優化策略,展示了強化學習在語言模型中的潛力。這種方式證明了 AI 可以無需人類指導,自主產生解決方案。

現今 LLM 的侷限:受限於人類偏好

相對於 RL,當前大型語言模型(如 ChatGPT)常以 RLHF(從人類反饋的強化學習)訓練,反而強化了人類的主觀偏好。模型雖可輸出多種回答,卻無法衡量哪個真正創新,只是迎合評審的「偏好」。

石Sir直言:「現在的 LLM 是優化於中庸、不是創新。」

打破「路徑依循」:AI 的第一原理思考

「為什麼火箭推進器的寬度是這樣?」石Sir提出這個問題並引導聽眾回顧歷史,從 NASA 火箭的設計,到美國鐵路,再追溯至英國馬車與古羅馬戰車,指出人類慣於依循舊有模式設計,這就是所謂的「路徑依循(Path Dependency)」。

但 AI 不具情緒、無懼於打破常規,若用強化學習訓練,便可擁有「第一原理思考」的能力——不依賴傳統,而是依物理或邏輯本質解決問題。

Emergent Behavior:OpenAI 捉迷藏實驗啟示

石Sir介紹 OpenAI 著名的 hide-and-seek 實驗,AI 透過數百萬次訓練,自己發現可「上鎖」逃避、甚至利用道具滑行等創新行為,超出了開發人員的預期。

這顯示出 AI 在有明確目標與環境回饋下,能自然發展出新穎的策略與行為。

元學習(Meta-Learning):AI 自學如何學習

更令人震撼的是,AI 現已可設計自己的學習算法。Derek Silver 與 Richard Sutton 的研究指出,AI 可透過 RL 學會建立比人類設計更優的學習方法,邁向真正的「自我優化」。

總結:AI 為創新賦能的新希望

石Sir總結指出,人類創新的兩大限制為:

壽命有限:無法進行數百萬次實驗。

思維受限:容易受經驗與情緒左右。

相對地,AI 可無限次試錯、不受偏見干擾。透過正確環境與強化學習,它有潛力超越人類,協助我們解決難以克服的科技與醫療挑戰。

未來的創新,不再依賴人類的直覺與資料,而是源於 AI 自身「與世界互動」所得來的智慧。

Post a Comment

0 Comments