{"id":34372,"date":"2025-09-11T07:00:05","date_gmt":"2025-09-11T07:00:05","guid":{"rendered":"https:\/\/www.digife.it\/?p=34372"},"modified":"2025-09-10T09:52:39","modified_gmt":"2025-09-10T09:52:39","slug":"llm-benchmark-but-how-intelligent-is-artificial-intelligence-really","status":"publish","type":"post","link":"https:\/\/www.digife.it\/en\/llm-benchmark-but-how-intelligent-is-artificial-intelligence-really\/","title":{"rendered":"LLM &amp; Benchmark: But how intelligent is an Artificial Intelligence really?"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">THE <\/span><b>Large Language Model (LLM)<\/b><span style=\"font-weight: 400;\"> they are the undisputed protagonists of the new digital era.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">They write texts, translate languages, solve complex problems, create code, and interact with users seamlessly. But\u2026 <\/span><b>Are they really intelligent? And if so, how much?<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Measuring the performance of Artificial Intelligence isn&#039;t just a technical challenge, but a cultural, social, and strategic one. In this article, we try to answer a crucial question for those working in the tech world:<\/span><\/p>\n<p><b>How do we evaluate the effectiveness and reliability of generative language models?<\/b><\/p>\n<h3><b>What is an AI benchmark?<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">A <\/span><i><span style=\"font-weight: 400;\">benchmark<\/span><\/i><span style=\"font-weight: 400;\"> it&#039;s a <\/span><b>standardized test<\/b><span style=\"font-weight: 400;\"> which serves to measure an LLM&#039;s abilities in specific areas: logic, linguistic comprehension, encyclopedic knowledge, problem solving, creative writing, programming.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The most famous?<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b><a href=\"https:\/\/en.wikipedia.org\/wiki\/MMLU\" target=\"_blank\" rel=\"noopener\">MMLU<\/a> (Massive Multitask Language Understanding)<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>ARC (AI2 Reasoning Challenge)<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Hellaswag<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>BIG bench<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>HumanEval (by code)<\/b><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">In practice: we subject the AI to a <\/span><b>battery of questions<\/b><span style=\"font-weight: 400;\"> and we evaluate whether he answers correctly.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But can we really judge intelligence? <\/span><i><span style=\"font-weight: 400;\">human<\/span><\/i><span style=\"font-weight: 400;\">\u2026with quizzes?<\/span><\/p>\n<h3><b>The Benchmark Paradox: AI Learns from Tests<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">A big problem with benchmarks is that, over time, models <\/span><b>they begin to &quot;study&quot; the questions<\/b><span style=\"font-weight: 400;\"> of the tests.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">How? It is enough that the datasets used for training contain (even partially) the benchmark questions, and the AI <\/span><i><span style=\"font-weight: 400;\">remember<\/span><\/i><span style=\"font-weight: 400;\"> statistically.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Result?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">GPT-4 gets top-notch results\u2026 but on questions it&#039;s probably seen before.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The scores seem to be going up, but <\/span><b>they no longer measure intelligence, but rather &quot;algorithmic memory&quot;<\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The model is good at testing, but <\/span><b>fails in the real world<\/b><span style=\"font-weight: 400;\">, where the questions are new, ambiguous, and non-standardized.<\/span><\/p>\n<h3><b>When scores are deceiving<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Imagine a chatbot that gets a 90% on the MMLU logic test.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Then you ask him:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u201cWrite me a Python script to extract emails from a CSV, but only those with a company domain.\u201d<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">Answer? Wrong code, not working.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The problem is not AI.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">The problem is that <\/span><b>we don&#039;t know what we&#039;re really measuring<\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Benchmarks test <\/span><b>one thing at a time<\/b><span style=\"font-weight: 400;\">, under controlled conditions.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">But the value of an LLM lies in its <\/span><b>ability to operate in real contexts<\/b><span style=\"font-weight: 400;\">, where needed:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">understand the ambiguity<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">manage the interaction<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">produce usable results<\/span><\/li>\n<\/ul>\n<h3><b>What <\/b><b><i>we should<\/i><\/b><b> measure in an LLM?<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">In the real world, intelligence is not just \u201canswering a question well.\u201d<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">AND <\/span><b>interact effectively<\/b><span style=\"font-weight: 400;\">, <\/span><b>to adapt<\/b><span style=\"font-weight: 400;\">, <\/span><b>learn<\/b><span style=\"font-weight: 400;\">, <\/span><b>produce value<\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here are the metrics that really matter, according to DigiFe:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Context<\/b><span style=\"font-weight: 400;\"> \u2013 Do you really understand the complete request, even if it is nuanced?<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Relevance<\/b><span style=\"font-weight: 400;\"> \u2013 Is the answer useful to the user or just \u201clinguistically correct\u201d?<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Transparency<\/b><span style=\"font-weight: 400;\"> \u2013 Does it justify the sources? Does it point out the limitations of the answer?<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Controlled creativity<\/b><span style=\"font-weight: 400;\"> \u2013 Is it able to generate new outputs, without inventing facts?<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Robustness<\/b><span style=\"font-weight: 400;\"> \u2013 Can you handle ambiguity, sarcasm, human error, and hybrid questions?\u00a0<\/span><\/li>\n<\/ul>\n<h3><b>LLM, Benchmarks, and Business: What Digital Professionals Need to Know<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">For a&#039;<\/span><b>communications, development and marketing agency like Digife<\/b><span style=\"font-weight: 400;\">, LLMs are precious tools\u2026 but they need to be understood thoroughly.<\/span><\/p>\n<p><b>Yes, we use AI in our processes.<\/b><b><br \/>\n<\/b><b>Yes, we test LLMs for copywriting, SEO analysis, data research, and technical support.<\/b><\/p>\n<p><span style=\"font-weight: 400;\">But <\/span><b>we never trust scores alone<\/b><span style=\"font-weight: 400;\">. Here&#039;s why:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A model that does 90% on MMLU may write flat or unusable text.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A \u201cpoorer\u201d LLM can offer better performance <\/span><i><span style=\"font-weight: 400;\">in specific tasks<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/li>\n<\/ul>\n<h3><b>The future? Customized benchmarks and dynamic testing<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The most interesting direction today is that of the <\/span><b>dynamic benchmarks<\/b><span style=\"font-weight: 400;\">:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>New prompts generated on the fly<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Real-world contexts simulated via API or plugin<\/b><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Models tested on real projects, not just quizzes<\/b><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Even companies and agencies like ours are starting to develop <\/span><b>internal metrics<\/b><span style=\"font-weight: 400;\"> to evaluate AI:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Ability to produce copy with a brand-consistent tone<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Code quality based on stack and performance<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Adaptability to human workflows<\/span><\/li>\n<\/ul>\n<h3><b>Less hype, more (real) intelligence<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Benchmarks are useful. But they&#039;re not enough.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">In 2025, <\/span><b>the true measurement of AI intelligence<\/b><span style=\"font-weight: 400;\"> must take into account the <\/span><b>value generated<\/b><span style=\"font-weight: 400;\">, <\/span><b>of operational reliability<\/b><span style=\"font-weight: 400;\">, from the <\/span><b>ability to collaborate with humans<\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">At Digife we don&#039;t just look at the numbers. We look at <\/span><b>How AI works with us<\/b><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">Every day we are building a more conscious and concrete way to integrate the potential of artificial intelligence into our projects.<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.digife.it\/en\/contacts\/\">Write to us<\/a> at info@digife.it and we&#039;ll help you evaluate tools, limitations, and real opportunities.<\/span><\/p>","protected":false},"excerpt":{"rendered":"<p>Large Language Models (LLMs) are the undisputed protagonists of the new digital era. They write texts, translate languages, solve complex problems, create code, and interact with users seamlessly. But... they are...<\/p>","protected":false},"author":4,"featured_media":34373,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[59],"tags":[],"class_list":{"0":"post-34372","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-notizie"},"_links":{"self":[{"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/posts\/34372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/comments?post=34372"}],"version-history":[{"count":1,"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/posts\/34372\/revisions"}],"predecessor-version":[{"id":34374,"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/posts\/34372\/revisions\/34374"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/media\/34373"}],"wp:attachment":[{"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/media?parent=34372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/categories?post=34372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.digife.it\/en\/wp-json\/wp\/v2\/tags?post=34372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}