Manus AI and DeepSeek: How Do These Chinese AIs Stack Up Against Grok 3 and ChatGPT

Let do a comparison of Manus AI, Grok 3, DeepSeek R1, and ChatGPT (including o3-mini and GPT-4o), based on their capabilities. Each model was evaluated across six key categories: Reasoning and Problem-Solving, Real-Time Data Access, Coding and Execution, Versatility and Creativity, Accessibility and Cost, and Speed. The analysis draws from recent benchmarks, public documentation, and industry reports, ensuring a thorough understanding for both technical and non-technical audiences.

Background and Context

Manus AI, launched on March 6, 2025, by the Chinese startup Monica, is a fully autonomous AI agent designed to execute real-world tasks end-to-end, such as travel planning and stock analysis (What is Manus? China's World-First Fully Autonomous AI Agent Explained). It has gained attention for its performance on the GAIA benchmark, with scores of 86.5% (Level 1), 70.1% (Level 2), and 57.7% (Level 3) (Manus AI Statistics and Facts).
Grok 3, released by xAI in February 2025, is a reasoning-focused model with advanced real-time data access via DeepSearch, scoring 93.3% on the AIME math benchmark (Grok 3 Beta — The Age of Reasoning Agents | xAI). It is tied to X Premium+ ($40/month) or rumored SuperGrok ($30/month) plans.
DeepSeek R1, from DeepSeek AI, is an open-source reasoning model launched in January 2025, known for its efficiency and cost-effectiveness, with a free tier and scores like 71.0% on AIME 2024 (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning).
ChatGPT, developed by OpenAI, includes o3-mini (a cost-efficient reasoning model with 87.3% on AIME high version) and GPT-4o (a versatile multimodal model), with access ranging from free to $200/month for Pro plans (OpenAI o3-mini: Performance, How to Access, and More).

model benchmark — Image from The Register

Category-by-Category Analysis

Reasoning and Problem-Solving

This category evaluates models on their ability to handle complex reasoning tasks, primarily using the AIME math benchmark for consistency, with GAIA as a secondary measure for real-world problem-solving.

Grok 3: Achieves 93.3% on AIME, indicating strong mathematical reasoning, and is designed for versatile problem-solving (Grok 3 Beta — The Age of Reasoning Agents | xAI).
Manus AI: No specific AIME score, but excels on GAIA with 86.5% (Level 1), 70.1% (Level 2), and 57.7% (Level 3), suggesting robust real-world reasoning (Manus AI Statistics and Facts).
DeepSeek R1: Scores 71.0% Pass@1 on AIME 2024, showing solid technical reasoning but lagging behind top models (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning).
ChatGPT o3-mini: High version scores 87.3% on AIME, competitive for reasoning tasks, with a focus on STEM (OpenAI o3-mini: Performance, How to Access, and More).

Winner: Grok 3, due to its highest AIME score, reflecting superior reasoning capabilities.

Real-Time Data Access

This category assesses the models' ability to fetch and integrate current information, crucial for dynamic tasks.

Grok 3: Features DeepSearch mode for real-time web and X searches, pulling fresh info instantly, enhancing its responsiveness (Elon Musk’s xAI releases its latest flagship model, Grok 3 | TechCrunch).
Manus AI: Likely has real-time data access for real-world tasks, given its autonomous execution capabilities, though specifics are unclear (Manus AI: Capabilities, GAIA Benchmark Insights, Use Cases & More).
DeepSeek R1: Offers web browsing, but reports suggest it struggles under high demand, limiting its real-time effectiveness (DeepSeek - R1 Online (Free|Nologin)).
ChatGPT o3-mini: Includes search integration for real-time data, with early prototype support, enhancing its utility (OpenAI O3-Mini: The Cost-Efficient Genius Redefining STEM AI | Medium).

Winner: Grok 3, with its advanced DeepSearch mode providing the most integrated real-time data access.

Coding and Execution

This category evaluates coding proficiency and the ability to execute tasks autonomously, using benchmarks like LiveCodeBench where available.

Manus AI: Excels in autonomous execution, building functional outputs like websites and games, with no specific benchmark scores but strong real-world performance (China’s Autonomous Agent, Manus, Changes Everything | Forbes).
Grok 3: Scores 79.4% on LiveCodeBench, beating GPT-4o (72.9%), indicating strong coding capabilities (Grok 3 Beta — The Age of Reasoning Agents | xAI).
DeepSeek R1: Achieves 57.2% on LiveCodeBench, with distilled models performing well in coding tasks, but overall lower than top models (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning).
ChatGPT o3-mini: High version scores 0.846 on LiveBench coding average, suggesting strong coding performance, though benchmark specifics vary (o3-mini Early Days — LessWrong).

Winner: Manus AI, due to its superior execution capabilities, surpassing others in practical task completion.

Versatility and Creativity

This category assesses the models' ability to handle diverse tasks, including creative writing and open-ended chats, considering ChatGPT's GPT-4o for its multimodal strengths.

Grok 3: Handles technical tasks and creative writing, with a focus on humor and open-ended chats, making it versatile (Elon Musk debuts Grok 3, an AI model that he says outperforms ChatGPT and DeepSeek | CNN Business).
ChatGPT (GPT-4o): Highly versatile and creative, excelling in multimodal tasks like image and text generation, with polished prose (GPT-4o vs GPT-4o Mini: Choosing the Right AI Model | Amity Solutions).
Manus AI: Focused on practical execution, less on creativity, with limited chat capabilities (Manus AI: Features, Architecture, Access, Early Issues & More | DataCamp).
DeepSeek R1: Weak on creativity, primarily technical, with dry responses (DeepSeek R1 Review: Performance in Benchmarks & Evals | TextCortex).

Winner: Tie between Grok 3 and ChatGPT (GPT-4o), both excelling in versatility and creativity, with GPT-4o slightly ahead in multimodal tasks.

Accessibility and Cost

This category evaluates ease of access and pricing, crucial for user adoption.

DeepSeek R1: Offers a free tier, open-source weights under MIT license, making it highly accessible, with API pricing at $0.14/million input tokens (cache hit) (DeepSeek R1 is now available on Azure AI Foundry and GitHub | Microsoft Azure Blog).
ChatGPT: Free base model, Plus plan at $20/month for o3-mini access, Pro at $200/month, offering broad accessibility (Announcing the availability of the o3-mini reasoning model in Microsoft Azure OpenAI Service | Microsoft Azure Blog).
Grok 3: Tied to X Premium+ at $40/month or rumored SuperGrok at $30/month, limited to X ecosystem, less accessible (Grok 3 AI is now free to all X users – here's how it works | ZDNET).
Manus AI: Invite-only, with codes reselling for up to $7,000 USD, likely premium, least accessible (Manus AI Statistics and Facts).

Winner: DeepSeek R1, due to its free tier and open-source nature, offering the best cost-effectiveness.

Speed

This category measures response and processing speed, vital for user experience.

Grok 3: Described as lightning-fast, with scripts and searches in seconds, leveraging xAI’s 100,000+ GPU backbone (Elon Musk’s ‘Scary Smart’ Grok 3 Release—What You Need To Know | Forbes).
Manus AI: Demos suggest fast for complex tasks, but no specific metrics (Another DeepSeek moment? General AI agent Manus shows ability to handle complex tasks | South China Morning Post).
DeepSeek R1: Achieves 381 tokens/sec, outpacing many rivals, but web browsing may lag (DeepSeek - R1 Online (Free|Nologin)).
ChatGPT o3-mini: Faster than o1-mini, with 2.5s faster time to first token, lower latency (OpenAI launches o3-mini, its latest 'reasoning' model | TechCrunch).

Winner: Grok 3, highlighted for its exceptional speed across tasks.

model compariso — Model Comparison by Artificial Analysis

Overall Assessment

Grok 3 emerges as the most well-rounded model, winning in Reasoning and Problem-Solving, Real-Time Data Access, and Speed, with a tie in Versatility and Creativity alongside ChatGPT (GPT-4o). Manus AI excels in Coding and Execution, particularly for autonomous task completion, but its invite-only status limits accessibility. DeepSeek R1 offers the best Accessibility and Cost, appealing to budget-conscious users with its open-source nature. ChatGPT, through o3-mini and GPT-4o, provides a balanced suite, with GPT-4o standing out for creativity and versatility. The choice depends on specific user needs, with Manus AI's rapid market impact (invite codes reselling for up to $7,000 USD) highlighting its high demand despite limited access (Manus AI Statistics and Facts).

This analysis ensures a comprehensive understanding, drawing from benchmarks like AIME (Comparison of AI Models across Intelligence, Performance, Price | Artificial Analysis), GAIA (GAIA: a benchmark for General AI Assistants | arXiv), and LiveCodeBench (LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code | arXiv), among others, to provide a detailed comparison.