I invite you to take a look at the results of months of work by me and the Oxido team, the most important part of which is a report on a study of language models.

But first: if you do not use the Polish language, I still encourage you to take a look at the LLM ranking, treating Polish as a representative of national languages.

The responses obtained from 12 tools were evaluated by a total of 11 people, which allows us to assess the performance of LLMs from the perspective of a typical user rather than solely through synthetic benchmarks (I do not dismiss their importance; we simply adopted a different methodology). But this is only the beginning. What sets my study apart is the fact that it was based entirely on the Polish language – the prompts were in Polish and we expected answers in the same language.

I also encourage you to read the accompanying texts to the study report, which are intended to add an extra dimension and better help define the stage of AI development we are currently at.

We will draw on the knowledge gained during the study, among other things, in e-learning on AI prompting and in delivering AI projects as part of Oxido.

Marek Jeleśniański

Study author / CEO · Oxido

Tested AI models

LLM ranking 2026 based on study - map of LLMs

Summary of the report

The report is based on tests of language models conducted via official chatbots (with the exception of EuroLLM). The aim was to assess how individual models – available through these interfaces – handle open-ended tasks. Both the testing approach and the selection of tasks stem from a simple assumption: this is how a typical user engages with AI. For this reason, the evaluations were carried out by people – 11 in total, including myself, representing a range of professions. When calculating the averages, a mechanism known from ski jumping was used, whereby one extreme score on either end is discarded. To make the process easier, participants were given assessment criteria and model answers for strictly factual questions. The models themselves vary in capability.

A few of the most important conclusions for me:

  1. The biggest surprise was the performance of the Llama 4 and Qwen 3.5 Plus models. This suggests that the gap between open models and their commercial counterparts (such as ChatGPT) in everyday use may not be that large. Credit should also be given to Google for the development of Gemini, which ultimately took first place (in the initial pilot in 2024 it did not reach the podium).
  2. Media attention focused on the low ranking of Polish models. If we look at the results without any assumptions or caveats, it should be noted that European models generally performed worse than their American and Chinese counterparts. At the same time, in the report I explain my perspective on the results, which bears no resemblance to headlines such as “Polish bots are idiots”, and I strongly oppose such use of my report. A model should be chosen to suit the task it is meant to perform and I emphasised this point strongly in the report.
  3. The one criterion on which all models failed was Polish humour. I put forward the idea that this may be the ultimate test of whether AI truly matches human ability.
  4. Overall, tasks related to professional work – which are the most important to me given my role as a trainer and consultant – were handled better by the models than tasks in the broadly defined socio-cultural category. The tests included, for example, writing emails and generating presentation content.
  5. What I hope for is a discussion on the direction of AI development in Poland and Europe, necessarily from a practical and business perspective. In the report and in my media statements, I stress the need to create a regulatory framework that ensures efficient funding, openness to risk and operational flexibility for startups. As regards LLMs, if we want to achieve a position comparable to top-tier models (so-called SOTA), we need to make swift and well-considered decisions. Moving in this direction, we should be more open to the needs of the typical user and the quality of the content we request on a daily basis. Ultimately, these criteria determine who we are willing to pay our hypothetical $20 per month to. To my mind, this direction is reflected in headlines about sovereign AI – without a European SOTA model, there will be no sovereign AI.

For the record, I should add that on 8 February I asked representatives of Bielik for an interview and received no response; on 24 February, the responses for evaluation were generated and the report was published on 9 March.

I encourage you to read the full report and the accompanying articles carefully – links can be found below.

Summary publication date: 19 March 2026.

PS As I already wrote in the report, I am strongly rooting for European models, including, naturally, Polish ones above all.

Language models should be selected according to the tasks we need to carry out and the regulatory conditions we operate under.

The report responds to questions I am often asked during training sessions, including those specifically concerning language models developed in Poland and Europe and the extent to which they are universal.

The fact that something is not universal does not mean it is bad.

The best large language models according to the study

The best models overall

(in the view of all participants)

The best models for work

(in the view of all participants)

The best models in my view

(based on my blind evaluations)

Read the full report

I encourage you to create your own ranking based on the results

LLM study in numbers

12
models tested
5
models in the run-off
320
evaluated responses
11
people evaluating the responses

Full LLM ranking 2026

(based on the main phase of the study)

Badania LLM - ogólny ranking modeli AI
Read the full report

The USA, China and Europe and their rivalry in the field of AI

Architecture, Silicon and Code – The Technological Race for Supremacy in AI

The USA, China and Europe and their technological and scientific rivalry in the field of artificial intelligence.

Capital, Energy and the Bubble – Who Will Foot the Bill for the AI Revolution?

What role – as both a resource and a constraint – do money, energy and infrastructure play.

Regulacje AI w Chinach, Europie i USA

AI Regulations – a Threat to Development or a Necessity?

Text coming soon – I encourage you to subscribe to the newsletter.

More interesting articles on large language models and beyond coming soon. Subscribe to the newsletterSubscribe to the newsletter, so you don’t miss a thing!

Exclusive sponsor:

Expert commentary

Jacek Bąk

In practice, the value of an AI model is not determined solely by benchmarks, but by whether it understands the user’s language, culture and working realities, which may differ in Poland compared with, for example, the USA. That is why research based on real-world use cases is so important. It shows how models perform in everyday work within a specific cultural and linguistic context, providing more practical insight than technical test results alone.

Jacek Bąk

AI content creator on YouTube

LLMs can be very helpful in legal research – they can quickly gather arguments, review a document or help organise material. However, they are not a reliable source of legal knowledge. Models, especially open ones, often hallucinate, meaning they may invent laws or case rulings. They are therefore best treated as an untrained legal assistant rather than a replacement for one. In addition, providers do not always guarantee the transfer of copyright to generated content or the security of confidential information. For this reason, great caution should be exercised when inputting and creating sensitive data in such systems.

Jakub Ferek

IT lawyer / AI law trainer · Oxido

Read the full report

Articles accompanying the LLM study

More interesting articles about large language models and beyond are coming soon. Subscribe to the newsletterSubscribe to the newsletter, so you don’t miss anything!

I invite you to sign up for the AI and Management newsletter. This way, you won’t miss any article on my blog.  Sign up