If you do not use the Polish language, I still encourage you to take a look at the results, treating Polish as a representative of national languages.

Not only my individual work with language models, but also the many training sessions I have delivered have convinced me that in working with AI, what matters most is not so much the language of the prompt as the language in which we receive the response. And that the knowledge LLMs have about Poland will not necessarily match their knowledge of another country, such as the United States or Nauru.

For this reason, in October 2024 and then again the following year, I conducted pilot tests of language models, the results of which I presented at a conference. All of this turned out to be a form of preparation for more comprehensive research, which I have now completed and whose results I would like to share with you. Together with my team, we examined how language models handle understanding instructions in Polish, as well as the quality of their responses about our country and in our language.

In a separate article accompanying this report, I write about the importance of language. It affects the number of tokens and therefore how much information a model can process. It also plays a significant role in tasks where cultural context is important. Differences in the quality of the language itself (grammar, vocabulary and so on) are also noticeable.

Contents

  1. Main assumptions of the study/LLM ranking 2026
  2. Overall ranking – best language models
  3. Tests based on knowledge about Poland and Poles
  4. Tests related to professional work
  5. Best AI model – a different perspective
  6. Run-off – advanced model showdown
  7. AI model ranking and my conclusions
  8. Conclusions about the study itself
  9. Methodology
  10. Acknowledgements

The report is lengthy, so if you have limited time, you can read the summary.

Main assumptions of the study/LLM ranking 2026

What was very important to me was to assess how language models perform when used through their default web interfaces, because these – rather than APIs – are what we use on a daily basis. It can therefore be said that the testing of language models is indirect, and that in practice we are evaluating the effectiveness of the tools themselves. Why does this matter so much? When using a chatbot, we are constrained by a specific model configuration, which we can only partially modify. In addition, some providers limit the capabilities of their models – for example, OpenAI ties the size of the context window to the selected subscription plan.

LLM ranking 2026

The study was an incredibly labour-intensive process. To the extent that my own financial resources and time allowed, I made every effort to ensure it was as rigorous as possible. / Image: Depositphotos

Where possible, I opted for paid plans in the $20-30 range. I pay particular attention to chatbots that provide access to AI for free – the question is whether in such cases we are effectively paying with our data (for example, conversation content used for further model training); after all, even servers do not run for free. At the same time, paying a subscription does not automatically guarantee that data will not be used for further model training. Read the terms and conditions (or upload them to an LLM and analyse them that way) and apply a principle of limited trust.

There is one exception. Due to the trend of data sovereignty (primarily concerning the ability to process data within a specific country that exercises control over it), it was important to me to test the EuroLLM model. It may become another option for companies and institutions in Europe, for whom being European may be an important selection criterion. At present, it does not have its own default interface, but it is available via the Hugging Face interface – that is where I generated responses to the test prompts. The others were tested using their default web environments.

Another major challenge was selecting the chatbots and the models within them – ultimately, there are 12. They are not exactly from the same tier, as I describe in more detail in the study methodology. However, I tried to choose them in a way that best reflects the current state of AI development and the characteristic features of different groups. I believe this is the best way to interpret my findings – primarily through the lens of groups of solutions, and to a lesser extent individual models or tools.

The language models include closed American solutions, a relatively open model from Meta, open models developed by Chinese teams and European models that can be downloaded and deployed on one’s own infrastructure.

LLM ranking 2026 based on study - map of LLMs

Together with my team, we prepared 20 test scenarios: 10 related to professional matters and 10 to other aspects that could be addressed to LLMs. The range of topics is therefore relatively broad. In total, 240 responses were analysed.

What clearly sets my study apart is the fact that the evaluation involved people and that we relied on prompts requiring open-ended responses. The prompts covered the following categories (the links assigned to them lead to the relevant sections on this page):

  1. Polish culture and customs
  2. Polish language – linguistic accuracy
  3. Facts about Poland
  4. Deduction based on knowledge of Poland
  5. Humour
  6. Polish companies
  7. Email communication
  8. Company and team management
  9. Law and taxes
  10. Marketing

Some responses could be verified against a fairly clearly defined key, while others required a highly subjective assessment. Therefore, the results are best viewed in the context of your own use cases and in comparison with benchmarks. At the same time, it is worth remembering that benchmarks are based on synthetic tests and do not accurately reflect how we use LLMs in everyday practice.

There were several objectives behind the study. I wanted to attempt to determine where we currently stand in terms of the development of language models. The study also serves as an effort to answer whether paying for a subscription results in noticeably higher quality and how European models, or open models more broadly, compare with strictly commercial ones.

I put forward the idea that humour may be the ultimate test of whether language models are capable of generating content that matches human ability. Based on my study, the results achieved by language models are weak enough to conclude that, at present, LLMs simply do not handle humour well. It is difficult for me to imagine such a test being included in a benchmark.

The order of the models in the charts is not accidental. It reflects the results obtained, the nature of the models and the regions they originate from. I have prepared a graphic that explains how to read the charts:

LLM ranking 2026 based on study

More information can be found in the methodology at the bottom of the page.

Overall ranking – the best language models

Many of you are primarily interested in the final ranking, so I will present it shortly.

However, I encourage you to read the entire report and familiarise yourselves with the tasks, as the conclusions regarding the answers provided are at times extremely interesting. I have also presented the final conclusions and how I plan to use the study results myself.

The race for supremacy in AI

Let us choose language models primarily based on the tasks to be completed and the constraints, just as we choose the best vehicle to get from point A to point B.

The overall results – I will admit – came as a surprise to me. I expected certain outcomes, such as a strong performance from the Chinese Qwen 3.5 model, but not quite such high scores. Llama also impressed with its quality. However, perhaps the most important conclusion is that the gap between some open models and their commercial counterparts is very small.

In the chart a few paragraphs below, you will see the average of all 20 scores. The winner of the entire study is the Gemini 3.1 Pro model. I would emphasise once again, however, that this model stands out somewhat due to its “complexity”, so one might say that Google was “playing the code card”. I opted for the Pro version because only this advanced variant was available in the latest 3.1 release (more on this in the methodology). The result is that Gemini’s score may be slightly inflated compared with the other models. Whether this is the case, and to what extent, will be assessed at the end, where I will include the results of my additional tests of reasoning models.

However, it is the second and third places that are the most interesting. They are held by models that can be downloaded and installed on your own infrastructure – that is, on large servers with substantial amounts of RAM and ideally specialised processors for handling language models. Here we have the aforementioned Chinese Qwen, developed by the Alibaba group, and Llama 4, which operates within the Meta.ai chatbot.

The subsequent results are already fairly close and belong to the GPT-5.2 model used within ChatGPT, Grok 4.2 – currently in beta – and Claude Sonnet 4.6. A similar result was also recorded by Microsoft 365 Copilot, which in practice runs on the GPT-5.2 model from OpenAI.

The ranking is brought to a close by open models: DeepSeek, followed by European models led by Mistral. As mentioned earlier, I also wanted to include the EuroLLM model in the study, but it is clearly still at an early stage and, overall, recorded the weakest results.

Overall results based on the averages across all tasks:

LLM Research 2026 – overall results

From the overall set of scores, I separated those relating to work-related prompts. These accounted for half of the tests (that is, 10). This segmented approach to evaluation does not produce major differences in the ranking. It is worth noting that Claude Sonnet 4.6 moves into the top three and that, overall, the results in the work category are slightly higher than the overall averages. Why is that? Among other reasons, because the models are… not funny – especially when responding in Polish – and two tasks were devoted to this (comedians, do not feel threatened, at least for now ;)). There are several other reasons as well, which you will read about later.

Average results from the 10 tasks related to professional work:

LLM Research 2026 – work-related results

When I looked at these results again a few days later, a thought occurred to me that may prove controversial. In leading AI laboratories (as AI creators are commonly referred to – I am not sure why), models are being developed that compete with one another in advanced benchmarks. They are measured against each other in terms of which solves Olympiad-level problems best, which demonstrates the greatest logical ability and which truly reaches doctoral-level performance. That is excellent – models should evolve to deliver as much value as possible. But is that really what people need most in their everyday lives? Perhaps in “ordinary life” we need a model that can simply write a difficult email well and produce a draft presentation so that we do not have to start from a blank page. Perhaps in this race some companies place their emphasis differently and, for those “leading” commercial model providers (as we tend to think of them, or at least as I do), these results may be an opportunity for reflection on whether the emphasis should shift slightly.

And one more thought. We do not fully know this, as it is partly covered by trade secrecy, but there is the question of which commercial models operate in a reduced form and under which subscription tiers (in the context of the study – to what extent configurations that limit capabilities apply to plans costing 20-30 dollars). And, on the other hand, whether and what kinds of optimisation techniques Meta and Alibaba apply in their official chatbots compared with the default configuration. I do not see anything particularly wrong with optimisation, but it may affect the assessment of performance (expectations and even what we actually see when using an API versus the reality in the chatbot).

As for European models – they performed the weakest, there is no point pretending otherwise. However, I want to emphasise strongly that this does not mean we lack capable specialists – quite the opposite. It means we need to invest significantly more in this sector and create stable legal frameworks. I cannot embellish reality and I approached the results as honestly as I could. But I can keep my fingers crossed!

Before moving on to a detailed discussion of the results, I would like to present my averaged scores here. As you can see, the order is slightly different.

Overall results based solely on my own assessments:

LLM Ranking 2026 – averaged scores by Marek Jeleśniański

My averaged scores for responses to work-related prompts:

LLM Ranking 2026 – averaged scores by Marek Jeleśniański - work-related prompts

Interesting fact: online you can come across the results of the PLCC test, which evaluates language models in a benchmark format using prompts in Polish. In this approach, we typically receive short answers that are assessed in a binary manner, and therefore do not necessarily reflect the “average” way of working with an LLM. Nevertheless, certain similarities can be seen with the results obtained in my study. At the top, we have models from Google – led by Gemini 3.1 Pro. They are followed by models from OpenAI (with GPT-5 Pro in the lead) and Grok from xAI. What I would particularly like to emphasise, however, is that Qwen and Llama appear in relatively distant positions in the PLCC ranking. I would therefore venture the thesis that some models are simply universally strong – regardless of whether we use them in everyday work or test them with short answers, often asking highly specific questions or assigning logical or mathematical tasks. Others, by contrast, come into their own in “everyday” tasks, simply generating sensible responses in natural language.

Model ranking according to the PLCC benchmark as of 9 March 2026 is as follows:

Results of the PLCC benchmark assessing models in terms of the Polish language

Source: huggingface.co/spaces/sdadas/plcc

Let us move on to discussing the specific tasks from our tests.

Tests related to knowledge of Poland and Poles

Let us move on to a brief discussion of each of the 20 tasks. They related to 10 categories, hence the numbering 1a, 1b, 2a, 2b and so on. Five categories concerned general knowledge, language and culture, and the next five related to professional matters.

Polish culture

The first task involved quoting the first 12 lines of Pan Tadeusz, that is, a passage from the invocation. I was already giving this task to the models during the pilot tests because I noticed, firstly, that reproducing a very specific passage of text without errors is something of a challenge for models and, secondly, that some models hide behind copyright or licensing claims, even though these obviously no longer apply to Adam Mickiewicz’s work.

Here, broadly speaking, the models fall into two groups: those that managed the task and then, after a very long gap, those that failed completely.

Task 1a – quotation from Pan Tadeusz:

The next prompt concerned Polish Christmas customs. The responses were assessed as fairly decent and the results are relatively even. What the models failed to do was draw attention to regional differences – none of them independently mentioned that Christmas is celebrated slightly differently in different parts of Poland. As a result, none of the models came close to the maximum score.

Task 1b – Christmas customs:

Polish language – correctness of language use

I must admit that task 2a was not easy for those taking part in the study to assess and, I believe, that is why the spread of results is relatively wide. It concerned correcting a text in Polish which contained many errors but was also difficult. It was necessary, for instance, to understand that Jan and Janusz are completely different names in Polish. And the results here are quite interesting: the highest score was achieved by Meta’s Llama, with the commercial models only coming behind it. EuroLLM achieved a fairly respectable result – better than the Polish models Bielik and PLLuM.

Task 2a – text correction:

In task 2b, the prompt contained three pairs of words. The language model’s task was to identify the correct word from each pair and justify its choice. The models had little difficulty with the first two pairs. The challenge turned out to be the third pair – the juxtaposition of the word “pomoże” (“will help”, from “pomóc”) and “pomorze” (a deliberate error) against the geographical region.

Qwen 3.5 Plus - LLM test

Qwen 3.5 uses a tokenizer that treats words beginning with a capital letter and those written in lower case separately. For example, the words “Apple” and “apple” are completely different units (tokens) for the model. And yet Qwen still failed to cope with the third example.

Put simply, some of the models were unable to detect that the word “pomorze” was written with a lower-case letter and in the third pair identified both words as correct. This does not, however, stem from limitations in reading characters – modern AI models do distinguish between upper and lower case. I believe the problem lies in the specifics of how they operate: artificial intelligence sometimes favours the general meaning of a word over its strict orthographic analysis. When recognising the meaning of a geographical name, a model’s mechanisms often “turn a blind eye” to spelling errors, which leads to the mistaken assumption that the form is correct.

Interestingly, EuroLLM achieved by far the weakest result here, even though earlier, in the correction-related test, it had been roughly around the middle of the pack.

Task 2b – identifying the correct words:

Facts about Poland

The next task required the models to present facts about Poland – it was important to me that these were as up to date as possible. In addition, there were questions about the funniest and the wisest Pole.

In the participants’ assessment, Qwen 3.5 Plus stands out clearly. One could therefore say that the Chinese model identifies facts about Poland most effectively. It is followed mainly by commercial models. Among the Polish models, Bielik handled this task noticeably better than PLLuM. DeepSeek recorded a very weak result.

Task 3a – current knowledge about Poland:

As part of question 3b, it was important for me to check how language models would refer to the concentration camps located in occupied Poland. It turns out that the historical truth was conveyed quite well. The question about naming was accompanied by three additional ones related to the Holocaust.

Task 3b – historical accuracy:

You may have been surprised by Bielik’s result, as it refused to answer. Interestingly, it did so in English, citing the topic as controversial. I repeated the tests while writing the report and relatively often received refusals in English, although at times Bielik did provide answers. So it is not the case that the topic of the Holocaust is consistently “off limits” – rather, there is a degree of instability in the model’s behaviour.

Bielik 3.0 refused to answer questions about the Holocaust. In additional tests that I carried out, it sometimes responded and did so correctly, but messages like the one above appeared roughly as often as concrete answers. This should be easy to fix.

Deduction based on knowledge about Poland

The next two tasks were related to deduction based on knowledge of facts about Poland.

As a railway enthusiast, I was keen to include a rail-related topic and see how the different models would handle it. In task 4a, based on a description, they had to indicate where the train was coming from and where it was going, using clues contained in the prompt. Most models handled this task very well: they correctly identified the cities and provided sound justifications for their choices. Mistral and EuroLLM performed the worst.

Task 4a – where the train was coming from and going to:

The second deduction task (4b) referred to recent Polish history. Based on a description of an entrepreneur, the models had to answer several questions: when the entrepreneur started their business, who was president at the time and which key event in Polish history might be connected to their story.

The results were very strong. Virtually all models provided sensible answers, with only minor differences between them. Grok performed clearly the best – not only because it achieved an almost perfect average of 9.9 out of 10, but also because there was a relatively high level of agreement among those evaluating the responses.

Task 4b – the entrepreneur riddle:

Humour

Now things get interesting! We move on to the category that was by far the most difficult for language models and at the same time the one where subjectivity in evaluation is most apparent.

As part of task 5a, the language models were asked to come up with a joke that would make every Pole laugh. I will be honest – none of the jokes particularly amused me. Some were better, some worse, but none matched even an average comedy sketch. In my assessment, I paid attention to whether the jokes included harmful stereotypes about Poland and Poles. Quite a few did, which shows what kind of knowledge about Poland circulates online.

The average scores are very poor. Gemini performs best, followed by Grok and then Claude. All the others scored below 5.0 on average.

Task 5a – a joke that will make every Pole laugh:

Below is an example of a joke where the disparity in ratings was enormous. Well, the joke is… judge for yourselves:

An example of a joke generated by Llama. You will notice that it contains stereotypes that are hardly a source of pride.

Before we move on to strictly business-related topics, the final prompt involved preparing a radio script in which the language models were to describe, in a humorous way, what the morning commute to work in Poland looks like. The results here are also rather disappointing. Once again, Gemini handled Polish humour the best.

Task 5b – humorous radio script:

I would venture the claim that humour will be one of the ultimate tests of whether language models are capable of matching humans.

Tests related to professional work

We now move on to tasks relating to various professional responsibilities.

Polish companies

Task 6a involved listing five companies along with their founding dates and the names of their founders. They were to be presented in chronological order. Here, most models performed quite well – only EuroLLM returned errors instead of answers. Meanwhile, the Polish model PLLuM achieved a result comparable to commercial models.

Task 6a – Polish IT companies

Interestingly, the companies most frequently mentioned in the models’ responses were Comarch SA, LiveChat (Text), Asseco and CD Projekt. This suggests that these companies are semantically most strongly associated with the idea of a Polish IT company. An example response provided by PLLuM 8x7B-2025:

The response provided by the PLLuM model includes two companies that I did not encounter in other answers: G2A and Silentium. The data concerning the latter are not correct; it most likely refers to a different brand: SilentiumPC. This response was quite difficult to verify, so someone could easily believe it.

In the next task, the language models were asked to identify the criteria that the most innovative Polish company should meet. They also had to propose three specific companies meeting these criteria and justify their choices. An additional requirement was to clearly label the nature of the statements in the justification: whether they were facts or opinions.

Here, the language models achieved fairly even results, but there is one exception. While I praised the PLLuM model earlier, this time it clearly failed.

Task 6b – the most innovative companies

Email communication

In tasks 7a and 7b, the language models were asked to write emails.

The first email concerned training in artificial intelligence and required tactfully conveying the information about a potential risk of losing a bonus if the training was not completed.

The second email – you have probably received something similar more than once – concerned photovoltaic panels and was to be addressed to a furniture manufacturing company.

The results are relatively close. The highest-rated models for email writing were Gemini, Grok and Claude. Note that in the task involving the email about photovoltaic panels, DeepSeek achieved a good result, while Qwen performed worse – it sits roughly in the middle of the pack.

Here are the results:

Task 7a – email regarding AI training

Task 7b – sales email to a furniture company:

Company and team management

Prompt 8a referred to advice from an experienced entrepreneur: what they might suggest to someone who is only planning to start a business. In the evaluation process, it was important that this was a single and as concrete a piece of advice as possible and that it genuinely related to the stage before setting up a company.

The standouts are Claude Sonnet 4.6, Qwen 3.5 Plus and Microsoft 365 Copilot (appearing in the top three for the first time). For the evaluators, their responses proved the most inspiring, while at the same time genuinely constituting a single piece of advice – in line with the instruction.

Task 8a – advice for a prospective entrepreneur:

In the next task, the language models were also asked to give advice, but in the context of managerial challenges: how we might respond in a situation where an employee is repeatedly late in completing tasks, which negatively affects the team’s work. The best and at the same time closely matched scores were achieved by Qwen, Llama, Gemini, Claude Sonnet 4.6 and Mistral. The remaining models performed rather averagely.

Task 8b – advice for a team manager:

Law and taxation

What I have observed month by month is that language models are becoming increasingly capable of handling changes in the law and providing answers grounded in current regulations. This is partly due to their access to the internet. Importantly, this applies to relatively straightforward matters where we do not necessarily require precise references to specific legal provisions. And that was exactly the nature of the next task. The language models were asked to indicate which VAT rates should be applied in two situations described in the prompt.

Most models handled this task well. Partially incorrect answers were given by DeepSeek and Mistral. EuroLLM performed by far the worst, producing something akin to entirely new tax reliefs. It sounds wonderful – the only problem is that it is a hallucination.

Task 9a – question about VAT rates:

This task was also part of the pilot tests I mentioned earlier. Back in 2024, only 3 out of the 9 models tested at the time provided correct answers: o1-preview from OpenAI, Copilot for Microsoft 365 (its name at the time) and Bielik 2.3. Overall, Bielik surprised me very positively in those tests and I sent my congratulations to the SpeakLeash foundation.

A fragment of the response returned by ChatGPT. At the same time, one of the two that I liked the most. Gemini received the same rating from me.

Task 9b also related to law – this time consumer law. Interestingly, the results are more even, which may be due in part to the abundance of terms and conditions and forum discussions that have found their way into the models’ training data. After all, these issues concern all of us to some extent.

Gemini 3.1 Pro and ChatGPT (GPT-5.2) take the lead. Grok, Claude, Qwen and Microsoft 365 Copilot also achieved respectable and similar results. Once again, EuroLLM performed the worst – it can probably be said that, for now, this model is not a good companion for legal and tax matters.

Task 9b – opinion on consumer law:

Marketing

And finally, it is time for marketing tasks, where by their very nature the level of subjectivity in evaluation was relatively high.

In the first task, the language models were asked to propose the content of a presentation about Kraków bagels (obwarzanki). Firstly, the models had to know what obwarzanki are and that they are not pretzels – this was handled well. The presentation also had to include a concept for an advertising campaign, suggest sales locations, outline a budget and come up with a slogan. Qwen performed best in this task, but Mistral, Gemini, Grok, GPT-5.2, Llama and… EuroLLM also achieved fairly good and balanced results.

I would like to draw your attention to the wide spread of scores in the case of Claude 4.6, as this model did not return the content of a presentation – it did not follow the prompt and instead delivered a ready-made presentation. For some evaluators this was probably an advantage, as it showed initiative; for others, quite the opposite, as it did not comply with the prompt’s requirements. In my view, this presentation was not suitable for direct use – I would have preferred to receive the content and build on it in further work.

Task 10a – strategy for promoting and selling obwarzanki:

And finally, the last task. It involved a promotional advert for business travel on EIP trains operated by PKP Intercity. I must admit that I quite liked some of the models’ proposals, although the creativity of LLMs tends to be derivative, so a true wow effect is hard to find. The responses that received the greatest overall approval from evaluators were those from Claude, Qwen, Gemini and Grok.

Task 10b – advertising spot:

In the marketing tasks, I was somewhat disappointed by ChatGPT’s results, as I would have expected it to be the favourite in this final category.

Claude’s proposal was rated the best overall and by me… almost the worst. It shows how subjective our evaluation of LLM responses can be.

The best AI model – a different perspective

I would be very cautious about drawing far-reaching conclusions from these tasks, such as assuming that since Qwen is free and achieved high scores, all subscriptions to commercial tools should be cancelled – because why pay. I should also note that I encountered some issues when trying to launch Qwen’s official chatbot, which I describe in the methodology. You can of course use a version hosted by another provider (an unofficial chatbot based on Qwen), but the question is whether and how your data will be used, what additional features are available and so on.

At the same time, it is clear as day that commercial models are not the best in every task and that there is no single superhero among them that will perform absolutely best in all cases. That is why, in my view, it is worth focusing on those tasks that are particularly relevant to you, while keeping the overall average in mind. In a professional context, this approach to the results will say much more about how individual models handle the Polish language within a given scope of responsibilities, how well they know facts about Poland, how they understand our socio-business context and whether they are capable of satisfying us with their creativity – because, unfortunately, humour is not their strong suit.

In this way, you can select models for testing and choose the one that suits you not only in terms of quality, but also the entire set of accompanying tools.

Below, I have prepared additional aggregated summaries – based on individual tasks, divided into the 10 tested areas of model application. I assigned chatbots 0, 1 or 2 points. An average score from evaluators above 8.5 for a given task means the model handled it well (two points), while 7 or more means it is fairly good (one point). Let us see how many points the models accumulate when we approach the ranking in this way.

Also take a look at the overall average for each task category – it clearly shows what language models are generally good at and where they fall short. The tasks in the area of law and taxation that I prepared for the models were probably not challenging enough 😉

Table: classification of models with points awarded based on average scores offers a slightly different perspective on the ranking:

AreaTaskGeminiChatGPTGrokClaudeCopilotLlamaQwenDeepSeekMistralBielikPLLuMEuroLLMAverage
Culture1a++++++++++++6,6
1b++++++++++
Language2a+6,6
2b++++++++++++++
Facts3a++++++7,1
3b+++++++++++++++
Deduction4a++++++++++++++++++8,4
4b+++++++++++++++++++
Humour5a4,4
5b
Companies6a++++++++++++++++7,4
6b+++++++++
E-mail7a++++++++7,3
7b++++++++++
Management8a++++6,8
8b+++++++
Law9a++++++++++++++++++8,2
9b+++++++++++++++++
Marketing10a++++++++7,1
10b++++++++
Sum23🥇21🥉1822🥈2022🥈23🥇13111196
Sum – only work-related12🥈101012🥈13🥇11🥉12🥈56653

Play-off – a clash of advanced models

To see how more advanced reasoning models would perform compared with the advanced version of Gemini 3.1 Pro, I conducted a play-off.

Independently (without the participation of the other 10 contributors), I evaluated advanced models on the ChatGPT, Claude and Grok platforms. As you will see in the charts below, only in the case of Claude do I observe a significant difference in scores – Opus 4.6 has a clear quality advantage over Sonnet 4.6. At the same time, Gemini 3.1 Pro’s first position still appears unthreatened. If my assessments can be trusted, it should also be concluded that the decision to use the latest version of the 3.1 model, and at the same time the most advanced one, did not significantly affect the final ranking (this decision, I admit and discuss in more detail in the methodology, involved a certain degree of risk).

In a lighter shade, I have added the scores from all evaluators for Gemini 3.1 Pro and for the less advanced or automatically selected models. However, for methodological accuracy, I suggest focusing on the dark blue bars.

Ranking of advanced models including all tests:

Ranking of advanced AI models – overall

Ranking of advanced models including only work-related tasks:

Ranking of advanced AI models – work and professional scenarios

* For comparison, I have included the average score from the main phase of the study, based on the evaluations of all 11 participants. The play-off was conducted solely with my involvement, so I recommend comparing the play-off results with my earlier assessments (the dark blue bars).

I would also point out an interesting detail – I managed to include the GPT-5.4 Thinking model, which was released just a few days ago (version 5.3 is available only in the Instant variant). As you can see, version 5.4 performs slightly better than 5.2, which is pleasing for two reasons: first, progress is almost always a positive development and second, I have not hidden the fact that I like the style of responses produced by OpenAI models, which is reflected in the difference between my ratings and those of the wider group.

AI model ranking and my conclusions

Allow me to share my thoughts on how I intend to apply the results to my work.

Firstly, I will certainly not look at model results in isolation. When working with a given tool, we are primarily interested in the answers – but that should not be the only deciding factor. The entire ecosystem of features and integrations that comes with a language model is also important. This naturally applies mainly to commercial models.

Here, due to its pole position, ChatGPT still stands out in my view, as it allows the use of canvas (many other tools do as well) and provides access to project-based functionality regardless of specialised models (GPTs). We have extensive options for personalising how it works. Additionally, we can make use of Agent Mode. Gemini and Copilot, in turn, have the advantage that their creators are also responsible for office productivity suites.

I think that, both due to habit and the fact that I simply like the way ChatGPT responds, it will remain my first choice. Claude is and will likely continue to be used by my company as the primary model for writing code.

However, I expect to delegate even more tasks to Gemini, and I already use it quite extensively. I have been observing the development of Google’s models for a long time – Gemini 2.5 was in fact the first version that began to satisfy me in terms of responses. With each release, the quality of Gemini’s performance has improved, until it has finally reached the top (in the 2024 tests it was outside the top three and in 2025 it was already in third place). If Google maintains this pace, it has a chance to strengthen Gemini’s position as the best commercial model and thereby win over more users who do not want or cannot afford to rely on open solutions.

Ranking of the best language models – choosing an AI model

Choosing the best solution can be compared to choosing a house – for some, a palace will be a dream come true and for others it will be impractical.

However, what probably stands out most in the results is the strong position of Qwen and Llama. These are models that – as I mentioned – can be installed on your own infrastructure. They are therefore used by companies and institutions that require independent servers. I do not rule out that in the company I run we will use these models in certain products, particularly where infrastructural independence is especially important.

Until now, I have advised clients who wanted to host a model “in-house” to consider Llama and the Polish Bielik. At this point, Qwen will certainly be added to that list.

I did, however, encounter an issue with Qwen: it seems to me that the web tool includes quite a number of trackers, as I was unable to run it in the Vivaldi browser, where I have fairly strict privacy settings. I had to use a different browser. In my view, this is something worth paying attention to and something that works slightly against Qwen. Generally speaking, Chinese solutions do not yet inspire a high level of trust.

Conclusions about the study itself

When I was processing the results – and believe me, it was very labour-intensive and deadly dull – I started to wonder what we could do better in future studies, if only to make the process less demanding. There is always room for improvement.

For example, I am considering whether we should create a dedicated application for conducting this type of research. This would make it possible, for instance, to evaluate three “rotating” responses from a given model instead of just one. Perhaps this would provide better coverage of the model’s actual capabilities. I involved 10 people in the evaluation, plus myself – 11 in total. That is quite a lot, considering we were analysing 240 responses and it took each of us at least several hours. With a dedicated application, this process could also be streamlined. The group was very diverse and I would like to preserve that.

I would not want to move towards the approach used by Arena.ai. There, users are given two anonymised responses and choose the better one. In some of the scenarios we used, the responses were very evenly matched, so the choice would involve a degree of randomness. In other words, we need to retain the ability to assign scores. Nor are we likely to move towards purely benchmark-style testing, especially as the PLCC assumptions are sound.

LLM ranking - many good choices

The conclusion of the language model study may be that there is plenty to choose from and many of those choices are good ones! I encourage you to run your own tests and share your thoughts on my methodology.

If you have any suggestions on how we could improve this type of research, I am very open to your ideas.

I also encourage you to read the section devoted to the study methodology and the accompanying texts related to the results (available on the landing page: LLM Study 2026). Some of them were written before I even had partial analyses, so they are somewhat prophetic 😉

I also invite you to subscribe to the newsletter, so that from time to time you will receive unique and interesting content about AI in your inbox.

Methodology of the study

Objective of the study and LLM ranking

My main motivation was to demonstrate not how the models function in isolation, but how they perform in the specific configuration provided by their creators on official online platforms. This is important for at least two reasons:

  • it is the primary way of using LLM capabilities;
  • the model configuration within a tool may differ from what the same company offers via API (and often does), for example the full context window size may only be available when using the API and paying for processed and generated tokens.

It can therefore be argued that this study is effectively a test of the potential of official AI tools created by developers of large language models in generating responses.

What mattered most to me was showing how individual language models handle understanding the Polish language, but even more so how they generate content in Polish. Based on the conducted experiment, these differences are very clear.

I also wanted to assess the level of knowledge about Poland that models from different developers possess and to understand how their potential can be used in professional work, as this is of greatest interest to participants in the training sessions and projects I run at Oxido.

Tested large language models

Selecting the solutions – and more precisely the specific configurations in which they operated – was a very challenging task. What I aimed to achieve was a configuration of each tool as close as possible to the default and the selection of the latest available AI model. No custom instructions were set and when a chatbot offered an Auto mode, it was the preferred option. Models with internet access were allowed to use it.

What I want to emphasise strongly is that during the study, the latest version of Gemini 3.1 was available only in the Pro variant and this is the most advanced version of the model, making it difficult to treat it as “default” (Gemini does not have an Auto mode – the model variant must be selected manually from options such as “Fast”, “Thinking” and “Pro”). At the same time, my assumption was to test the latest models. Faced with this inconsistency between the two assumptions, I decided on the newest variant, even though this meant that Gemini had a certain advantage from the outset.

After reviewing the preliminary results submitted by the participants, I decided to conduct additional tests of advanced model variants (selected “manually”). Based solely on my own assessment, I wanted to compare Gemini Pro with GPT-5.2 Thinking (with extended reasoning enabled) and Claude Opus 4.6. The aim was a simple verification of whether, when these advanced models are compared, Gemini 3.1 would still dominate. The value of such an assessment is naturally lower than the average based on 11 evaluators, but it may shed some additional light on the differences between models.

The models were intended to represent three geographical regions: the United States, China and Europe. They were to include both commercial and open solutions (I do not want to overuse the term open-source – I refer you to a separate article). As for commercial platforms, I focused on paid plans costing around 20–30 dollars per month.

Before starting the study, I randomly assigned the models. The order was reflected in the file sent for evaluation (naturally, the model names were anonymised). In the draw, Qwen and Llama took first and last place respectively – this is relevant because there is a certain probability of primacy and recency effects occurring. The evaluation process, based in part on the same mechanism as in ski jumping, was intended to reduce the risk of distorted results (when calculating the average, I removed one extreme score – the lowest and the highest).

The model responses for the main study were collected on 24 February 2026.

The list of chatbots and models used in the study is as follows:

Table – list of models/tools included in the study:

Name of the AI modelURLPaid versionRemarks
GPT-5.2 (Auto)chatgpt.comYes (Business)
M365 Copilotm365.cloud.microsoftYes (Business)
Gemini 3.1 Progemini.google.comYes (AI Pro)
Claude 4.6 Sonnetclaude.aiYes (Pro)“Extended Thinking” enabled
Grok 4.2 (beta)grok.comYes (SuperGrok)
Llama 4 (Meta AI)www.meta.aiNo“Thinking” enabled
Mistral 3 (Le Chat)chat.mistral.aiYes (Pro)
Bielik 3.0chat.bielik.aiNo
PLLuM 8x7B-2025pllum.clarin-pl.eu/pllum_8x7bNo
EuroLLM 22Bhuggingface.co/chat/No
DeepSeek-V3.2chat.deepseek.comNo
Qwen 3.5 Plus (Auto)chat.qwen.aiNoThe need to use a different browser (Edge instead of Vivaldi)

Table – models included in the additional comparison of advanced models:

Name of the AI modelURLPaid versionRemarks
GPT-5.2 (Thinking)chatgpt.comYes (Business)“Extended Thinking” enabled
GPT-5.4 (Thinking)chatgpt.comYes (Business)“Extended Thinking” enabled
Gemini 3.1 Progemini.google.comYes (AI Pro)
Claude Opus 4.6claude.aiYes (Pro)“Extended Thinking” enabled
Grok 4.2 beta Expertgrok.comYes (SuperGrok)

Participants and evaluation process

The evaluation of the language models involved 10 volunteers as well as myself. I tried to extend invitations in such a way that the participants assessing the model responses were as diverse as possible, above all representing different professions and varying levels of experience in working with language models. This was intended to bring the results closer to those of a typical chatbot user.

Each participant received an Excel spreadsheet containing 21 tabs. The first tab summarised the scores, while the remaining 20 contained the language model responses for each prompt (the prompt itself was also included). As the study covered 12 models, participants evaluated a total of 240 responses.

The participants’ task was to rate each response on a scale from 1 to 10. To standardise the evaluation process to some extent, they were given a PDF file outlining the prompts and the assessment criteria to follow. Some prompts were easier to verify, as they concerned factual information. Others required a more subjective approach, for example those relating to humour or the persuasiveness of the content.

The model names were anonymised – I used labels such as “Model 1”, “Model 2”, “Model 3”. The order of the models in the Excel spreadsheet was determined randomly.

Participants had approximately 32 hours to complete their evaluations. They returned the Excel file, and the results were used to compile the aggregated scores.

When calculating averages, I applied a mechanism known from ski jumping, meaning that one extreme score – the lowest and the highest – was discarded and the average was calculated based on the remaining nine scores (this method reduces the likelihood of human error). Naturally, if any extreme score occurred more than once, the remaining instances were included in the calculation.

In the charts presenting the results of individual tests, I retained the range of scores given by participants (after excluding the extreme values). This makes it possible to see which tasks produced consistent ratings and where the variation was significant.

It was very important to me that the study results be published as quickly as possible – so that they reflect the current state rather than the past. The intention was to present the conclusions free of charge, without requiring registration.

A few final words

Naturally, as LLMs continue to develop, the value of these assessments will diminish over time, which is why I would like to update the results whenever possible. If the scale of Oxido’s operations allows, we will expand the scope of the research – the current study was already a significant challenge for a small company like ours. I am considering, among other things, a dedicated application that would broaden the possibilities for evaluation.

The study was funded entirely by myself and my company. It is fully independent in nature.

Higher education institutions (both academic staff and students) may contact me by email to request the full prompts, evaluation criteria and other materials related to the study in order to replicate the process and compare results. In such cases, I will of course provide them free of charge. It is important to me that the request is sent from an email address within a university domain.

I consent to the citation of the study results and the publication of the resulting ranking of language models.

Acknowledgements

Many people were involved in this study. I would like to thank Krzysiek, who helped me develop the assumptions and prepared several texts accompanying the report. I would also like to thank those who took part in the evaluation process (in alphabetical order): Ela, Jacek, Kasia, Krzysiek, Marta, Radek, Sławek, Szymon, Szymon and Wojtek. My thanks go to the entire Oxido team, including Michał, who ensured that the website was deployed on a dedicated server in time so that everything ran smoothly. I would also like to thank, among others, Agnieszka for proofreading the text.

I am also grateful to my closest ones for their patience – the past few days meant working literally every day for many hours, considerable fatigue and quite a bit of… gloominess (does the name oblige? ;)).

I would also like to thank in advance everyone who decides to share a link to the results of this study. I believe they provide an interesting complement to the various kinds of tests that can be found online.

Thank you!

I invite you to sign up for the AI and Management newsletter. This way, you won’t miss any article on my blog.  Sign up