


I invite you to take a look at the results of months of work by me and the Oxido team, the most important part of which is a report on a study of language models.
But first: if you do not use the Polish language, I still encourage you to take a look at the LLM ranking, treating Polish as a representative of national languages.
The responses obtained from 12 tools were evaluated by a total of 11 people, which allows us to assess the performance of LLMs from the perspective of a typical user rather than solely through synthetic benchmarks (I do not dismiss their importance; we simply adopted a different methodology). But this is only the beginning. What sets my study apart is the fact that it was based entirely on the Polish language – the prompts were in Polish and we expected answers in the same language.
I also encourage you to read the accompanying texts to the study report, which are intended to add an extra dimension and better help define the stage of AI development we are currently at.
We will draw on the knowledge gained during the study, among other things, in e-learning on AI prompting and in delivering AI projects as part of Oxido.

The report is based on tests of language models conducted via official chatbots (with the exception of EuroLLM). The aim was to assess how individual models – available through these interfaces – handle open-ended tasks. Both the testing approach and the selection of tasks stem from a simple assumption: this is how a typical user engages with AI. For this reason, the evaluations were carried out by people – 11 in total, including myself, representing a range of professions. When calculating the averages, a mechanism known from ski jumping was used, whereby one extreme score on either end is discarded. To make the process easier, participants were given assessment criteria and model answers for strictly factual questions. The models themselves vary in capability.
A few of the most important conclusions for me:
For the record, I should add that on 8 February I asked representatives of Bielik for an interview and received no response; on 24 February, the responses for evaluation were generated and the report was published on 9 March.
I encourage you to read the full report and the accompanying articles carefully – links can be found below.
Summary publication date: 19 March 2026.
PS As I already wrote in the report, I am strongly rooting for European models, including, naturally, Polish ones above all.

Language models should be selected according to the tasks we need to carry out and the regulatory conditions we operate under.
The report responds to questions I am often asked during training sessions, including those specifically concerning language models developed in Poland and Europe and the extent to which they are universal.
The fact that something is not universal does not mean it is bad.

(in the view of all participants)

(in the view of all participants)

(based on my blind evaluations)
I encourage you to create your own ranking based on the results
(based on the main phase of the study)

The USA, China and Europe and their technological and scientific rivalry in the field of artificial intelligence.
What role – as both a resource and a constraint – do money, energy and infrastructure play.

Text coming soon – I encourage you to subscribe to the newsletter.
More interesting articles on large language models and beyond coming soon. Subscribe to the newsletterSubscribe to the newsletter, so you don’t miss a thing!
In practice, the value of an AI model is not determined solely by benchmarks, but by whether it understands the user’s language, culture and working realities, which may differ in Poland compared with, for example, the USA. That is why research based on real-world use cases is so important. It shows how models perform in everyday work within a specific cultural and linguistic context, providing more practical insight than technical test results alone.
LLMs can be very helpful in legal research – they can quickly gather arguments, review a document or help organise material. However, they are not a reliable source of legal knowledge. Models, especially open ones, often hallucinate, meaning they may invent laws or case rulings. They are therefore best treated as an untrained legal assistant rather than a replacement for one. In addition, providers do not always guarantee the transfer of copyright to generated content or the security of confidential information. For this reason, great caution should be exercised when inputting and creating sensitive data in such systems.
More interesting articles about large language models and beyond are coming soon. Subscribe to the newsletterSubscribe to the newsletter, so you don’t miss anything!
I invite you to sign up for the AI and Management newsletter. This way, you won’t miss any article on my blog. Sign up

