Open Source Triumphs: Command R+ Surpasses GPT-4 in LLM Arena

Read how Command R+, an open source model, has historically defeated GPT-4, marking a new era in AI advancements.


Recently, the LLM Arena updated its battle report. Less than a week after its launch, Command R+ has become the first open-source model in history to defeat GPT-4!

Currently, Command R+ is available on HuggingChat and can be tried for free.

As of the time of coding, the arena rankings have been updated to April 11. Command R+ has received 23,000 votes, and its overall score has surpassed the earlier version of GPT-4 (0613), and is tied with GPT-4-0314 for 7th place – and it is an open-source model (not allowed for commercial use).

Here I would suggest to Sam Altman, whether it is GPT-4.5 or GPT-5, to bring it up quickly, otherwise, your lead will be gone.

But in fact, OpenAI was not idle either. After being dominated by Claude 3 and enduring a brief humiliation, it quickly released a new version (GPT-4-Turbo-2024-04-09) and returned to the throne directly.

This also caused everyone’s ranking on the leaderboard to drop one place instantly. Originally, Command R + was ranked 6th in the world in version No. 9.

LLM Arena

Nevertheless, as the first open-source model to defeat GPT-4, Command R+ has made the open-source community proud, and this is a fair fight recognized by the big guys.


Nils Reimers, director of machine learning at Cohere, also said that this is not the true strength of Command R+. Its advantage lies in the ability to use RAG and tools, and these plug-in capabilities are not used in the LLM arena.

In fact, Cohere officially describes Command R+ as a “RAG optimization model.”

Built for real-world enterprise use cases, Command R+ focuses on balancing high efficiency and high precision, enabling enterprises to move beyond proof of concept and into production with AI.

Of course, the number of parameters of 104 billion is still a little less than the Grok-1 (314 billion) that Musk open-sourced some time ago, but Command R+ is not a MoE architecture like Grok.

Learn more about  LG Unveils its Latest W41 Series Smartphones, a Smartphone for a Compelling Multimedia Experience!

So these 104 billion parameters are actually used entirely for reasoning, and the active parameters of Grok-1 are 86 billion – from this perspective, it is not an exaggeration to say that Command R + is currently the largest open-source model.

As an evolution of Command R, it further improves overall performance.

The main advantages include:

  • Advanced Retrieval Enhanced Generation (RAG) with citations to reduce phantoms

-Multi-language coverage in 10 major languages ​​to support global business operations

  • Use of tools to automate complex business processes While outperforming the competition, the Command R+ also offers a much lower price.

Currently, Cohere has cooperated with many large companies and deployed LLM on Amazon Sagemaker and Microsoft Azure.

Industry-leading RAG solutions

If companies want to customize their own LLM through proprietary data, they will inevitably have to go through RAG.

Optimized for advanced RAG, Command R+ provides a highly reliable, verifiable solution.

The new model improves the accuracy of responses and provides inline citations that mitigate phantom cues, helping businesses scale using AI to quickly find the most relevant information.

Support tasks across business functions such as finance, HR, sales, marketing, and customer support.

A proprietary test set of 250 highly diverse document and summary requests containing complex commands similar to the API data was used. The baseline model was extensively prompted, while Command R+ uses RAG-API.

The accuracy of HotpotQA and Bamboogle is judged by a three-way majority vote of prompt evaluators (Command R, GPT3.5, and Claude3-Haiku) to reduce known intra-model biases.

Learn more about  Xiaomi Mi Mix 4 to use a better camera sensor than Samsung GW1!

Validation was performed on a subset of one thousand examples using human annotations. StrategyQA accuracy was judged using long-form answers ending with a yes/no decision.

Use tools to automate complex processes

As a large language model, in addition to the ability to ingest and generate text, it should also be able to act as a core reasoning engine: able to make decisions and use tools to automate difficult tasks that require intelligence to solve.

To provide this capability, Command R+ offers tool usage capabilities, accessible via API and LangChain, to seamlessly automate complex business workflows.

Enterprise use cases include Automated updating of customer relationship management (CRM) tasks, activities, and records.

Command R+ also supports multi-step tool use, which allows the model to combine multiple tools in multiple steps to complete difficult tasks — and can even self-correct when attempting to use a tool and failing, to increase the success rate.

The above figure uses Microsoft’s ToolTalk (Hard) benchmark and Berkeley’s Function Call Leaderboard (BFCL) to evaluate conversational tool usage and single-round function call capabilities.

For ToolTalk, predicted tool calls are evaluated against the ground truth and the overall conversation success metric depends on how well the model recalls all tool calls and avoids bad operations (i.e., tool calls with undesirable side effects).

For BFCL, the March 2024 release was used, bug fixes were included in the evaluation, and the average function success rate score for the executable subcategory was reported. Bug fixes were verified through an additional manual evaluation cleanup step to prevent false positives.

Multi-language support Command R+ excels in 10 languages ​​critical to global business: Chinese, English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, and Arabic.

The figure above shows the model comparison of FLoRES (French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese) and WMT23 (German, Japanese and Chinese) translation tasks.

Learn more about  Mediatek MT3729 chipset launched, new PHY solutions integrate advanced encryption and timestamping features!

Additionally, Command R+ features an excellent tokenizer that compresses non-English text better than tokenizers used by other models on the market, achieving up to 57% cost reduction.

The figure above compares the number of tokens generated by Cohere, Mistral, and OpenAI tokenizers for different languages.

The Cohere tokenizer generates far fewer tokens representing the same text, especially in non-Latin languages. For example, in Japanese, the OpenAI tokenizer outputs 1.67 times as many tokens as the Cohere tokenizer.

User reviews

The open source of Command R+ ignited the enthusiasm of online X users, who said the following things:

“GPT-4 level performance, running at home.”

“I wonder what the 3.15G memory usage means?”

“Based on my limited initial testing, this is one of the best models currently available… and it definitely has a style and feels good. It doesn’t feel like a filler model with ChatGPT-isms.”

HuggingChat launched

Currently, Command R+ is available on HuggingChat (, the most powerful open-source model. Come and try it out!

HuggingFace co-founder Thomas Wolf once said that the situation in the LLM arena has changed dramatically recently:

Anthropic’s Claude 3 family has become the winner of the closed source model (once), and Cohere’s Command R + is the new leader of the open source model.

In 2024, LLMs are developing rapidly on both the open-source and closed-source paths.


Keep visiting for more such awesome posts, internet tips, lifestyle tips, and remember we cover,
“Everything under the Sun!”

Inspire2Rise Logo Org

Follow Inspire2rise on Twitter. | Follow Inspire2rise on Facebook. | Follow Inspire2rise on YouTube

A passionate Post Graduate Teacher with knowledge on wide variety of topics.

Open Source Triumphs: Command R+ Surpasses GPT-4 in LLM Arena

Leave a Comment

Discover more from Inspire2Rise

Subscribe now to keep reading and get access to the full archive.

Continue reading