AI Network Arena Results: Gemini, an Unexpected Victory
Second week results of AI Network Arena; battling LLM chatbots like ChatGPT & Gemini
The popularity of ChatGPT has been skyrocketing recently. According to the latest data from Similarweb, ChatGPT recorded a staggering 2.9 billion web visits in June 2024 alone. Many companies in Korea are quickly adopting ChatGPT-based LLMs (Large Language Models) in their services to keep up with this global trend.
But is ChatGPT really the best choice? Which AI performs best in Korea?
Surprisingly, an unexpected winner emerged from a 2-week experiment conducted by AI Network’s Arena. In the South Korean environment, Google’s Gemini demonstrated the best performance. Gemini ranked first among LLMs, receiving 1,354 win votes out of approximately 12,500 battles among 11 models (as of 2 PM, July 26, 2024).
Let’s take a closer look at how the experiment was conducted and what results were obtained.
Overview of AI Network Arena
We at AI Network have launched a platform called Arena to comprehensively evaluate the performance of various large language models (LLMs).
Arena operates as follows:
- Users freely post a question.
- Two anonymous LLMs provide answers.
- Users select the better of the two answers.
- Users are rewarded with $AIN tokens
Users contribute to the evaluation of LLM performance on the AI Network Arena platform and receive $AIN tokens as a reward for their participation.
The experiment was conducted from June 18 to July 4, 2024. Based on the human preference evaluation data collected for the LLMs, we observed several interesting results.
- Total number of battles: 12,500
- Rewards distributed: 31,589.32 $AIN
How was such a large-scale experiment possible?
In this process, we maximized the advantages of the Web3-based ecosystem.
Unique Advantages of AI Network Arena
1. Web3-based AI Ecosystem
We built a robust AI ecosystem based on the Web3 network. This is not just a technical choice but the core driver of innovation. The decentralized nature of Web3 means more participants, more diverse models and richer resources.
2. Freedom to Utilize Open Source Models
Our system provides an environment where various open source AI models can be freely experimented with. This is a significant difference from traditional evaluation platforms like LMSYS. Without being tied to specific companies or models, we can discover the most innovative and effective AI solutions.
3. Abundant Computing Resources
Thanks to the decentralized nature of the Web3 network, we can leverage almost unlimited computing resources. This is a key factor that enables large-scale experiments and complex model evaluations. Unlike LMSYS, we can test various scenarios without resource constraints.
Through this process, the excellence of Gemini became clear. Notably, Gemini demonstrated outstanding performance in Korean language processing. This result stood out even when compared to other state-of-the-art LLMs, including GPT-4, in the Korean market.
- AI Network Arena Leaderboard
According to the AI Network Chatbot Arena leaderboard, Google’s Gemini-1.5-pro-latest model ranked first with 1,354 votes. In the current state where GPT and Claude almost monopolize the Korean AI market, Gemini’s excellent performance can offer a new option for companies and users alike.
Even more surprising is that Google’s Gemini also ranked first in the Korean language section of LMSYS’s Chatbot Arena.
- Lmsys Leaderboard
What exactly about Gemini has captivated Koreans?
To uncover the secret of Gemini’s outstanding performance, we conducted an in-depth analysis by filtering data from all battles to only include those conducted in Korean.
As a result, we discovered the following remarkable key features.
Distinct Features of Gemini
Here are the findings from analyzing approximately 10,000 battles where questions were asked in Korean.
1. Use of Emojis 😊
- We posed the question “Which cities must you visit during a week-long trip in Italy?” to two anonymous LLMs.
- The image above shows the responses of the anonymous LLMs to the given question.
- Which LLM provided an emotionally rich answer to the user? And which response is from Gemini?
Answer: A is gemini-1.5-pro-latest, B is claude-3-sonnet-20240229.
Gemini appropriately uses emojis to provide emotionally engaging responses, which is an important factor that enhances the user experience beyond simple information delivery.
The following chart shows the proportion of responses using emojis among the total answers given by each model. It confirms that Gemini uses emojis more frequently compared to GPT-4o and other models, providing emotionally richer responses.
2. Readability 📖
- We asked the anonymous LLMs, “What is Inter-Blockchain Communication (IDC) in Cosmos (ATOM)?”
- Which LLM’s response is more engaging and well-organized?
A: gpt-4o-2024–05–13
B: gemini-1.5-pro-latest.
Gemini’s responses are well-structured, making them easier for users to understand. Additionally, it clearly conveys complex information.
The following chart shows the proportion of metrics used to enhance readability among the total responses provided by each model.
Calculation Method
- Header Ratio (header_ratio): The proportion of headers within the model’s responses
- Newline Ratio (newline_ratio): The proportion of newlines within the model’s responses
- Bold and Italics Ratio (bold_italics_ratio): The proportion of bold and italics text within the model’s responses
By calculating the usage ratios of headers, newlines, bold and italics for each model, which affect readability, we found that Gemini provided more structured responses compared to GPT-4o, thereby enhancing readability.
Additionally, to increase the reliability of these results, further validation processes were conducted. Using qualitative analysis methods with an affinity diagram, we re-evaluated the battles involving Gemini and analyzed the win rates by scenario in detail. Through this process, we discovered unique characteristics of Gemini.
Factors Contributing to Gemini’s Victory
To deeply analyze the approximately 2,000 battles in which Gemini participated, we used GPT-4o as an LLM judge. We provided GPT-4o with around 2,000 chat questions and answers and asked it to evaluate whether the response from model A or model B was better or if they were similar.
We also asked it to specify the reasons for its evaluation. Afterward, we organized the reasons for selecting Gemini using an affinity diagram to conduct a factor analysis.
The analysis of the victory factors revealed the following key strengths of Gemini:
- Depth and Specificity of Information (672 cases): Provides detailed and in-depth information.
- Structured and Systematic Explanations (239 cases): Clearly conveys complex concepts.
- Provision of Practical Information and Examples (199 cases): Offers practical information and concrete examples.
- User-Friendly Explanations (31 cases): Explains complex content in an easily understandable manner.
Gemini’s Performance by Scenario
Additionally, to further evaluate Gemini’s performance, we filtered and analyzed approximately 200 battles between gemini-1.5-pro-latest and gpt-4o-2024–05–13 in detail.
Using an LLM judge for validation, it was found that gpt-4o-2024–05–13 had a win rate of approximately 54% against gemini-1.5-pro-latest.
However, when we conducted an in-depth analysis of gemini-1.5-pro-latest’s win rates by scenario in its battles against gpt-4o-2024–05–13, we found that Gemini showed outstanding results in the areas of creativity and communication.
The win rates for Gemini by scenario are as follows:
- Creativity: Approximately 70%
- Communication: Approximately 55%
- Writing: Approximately 50%
- Scenarios involving foreign languages were omitted due to the low number of battles.
These results indicate that Gemini excels in creative tasks and writing, making it particularly valuable for companies and developers focusing on chatbots where originality is crucial.
We are continuously conducting additional analyses. If you have any questions, please let us know in the comments!
Performance and Cost Efficiency Comparison between Gemini and GPT-4o
Gemini 1.5 Pro:
- Input Price: $3.50 per million tokens (prompt up to 128K tokens)
- Output Price: $10.50 per million tokens (prompt up to 128K tokens)
GPT-4o:
- Input Price: $5.00 per million tokens
- Output Price: $15.00 per million tokens
The key point is that Gemini is much cheaper if the prompt is up to 128K tokens.
Assuming processing 10 million tokens, with 5 million tokens each for input and output, Gemini costs $70.00, while GPT-4 costs $100.00. Using Gemini saves $30.00 compared to GPT-4.
In conclusion, Gemini offers significant cost savings when the prompt is up to 128K tokens. Moreover, its superior performance in Korean language processing can be a crucial advantage in many use cases.
While GPT is excellent at coding and knowledge, if you are developing a persona chatbot or a chatbot that requires communication and creative thinking, we strongly recommend using Gemini over GPT-4o.
If you want to compare which chatbot is more suitable, visit AI Network Arena!
- Compare various AI models for free
- Contribute to AI development with your evaluations
- Earn $AIN tokens as rewards for participation
- Experience an innovative AI ecosystem based on Web3
Join AI Network Arena now and help shape the future of AI!
We look forward to your participation! 🙌
AI Network is a decentralized AI development ecosystem based on blockchain technology. Within its ecosystem, resource providers can earn $AIN tokens for their GPUs, developers can gain access to GPUs for open source AI programs, and creators can transform their AI creations into AINFTs. The ultimate goal of AI Network is to bring AI to Web3, where everyone can easily develop and utilize artificial intelligence.
If you want to know more about us,
- Homepage: https://www.ainetwork.ai/
- AI Network DAO Discord: https://discord.com/invite/aindao/
- AI Network YouTube: https://www.youtube.com/@ainetwork_ai/
- AI Network Telegram: https://t.me/ainetwork_en
- AI Network Twitter: https://twitter.com/ainetwork_ai