{"id":15,"date":"2025-01-17T16:45:57","date_gmt":"2025-01-17T16:45:57","guid":{"rendered":"https:\/\/blog.chataignon.org\/joseph\/?p=15"},"modified":"2025-02-19T17:28:18","modified_gmt":"2025-02-19T17:28:18","slug":"why-chatgpt-cant-spell-properly","status":"publish","type":"post","link":"https:\/\/blog.chataignon.org\/joseph\/post-15\/why-chatgpt-cant-spell-properly\/","title":{"rendered":"Why ChatGPT can&rsquo;t spell"},"content":{"rendered":"\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\"><summary>disclaimer<\/summary>\n<p>The experiments presented here were not all carried out at the same time, and the results of a particular test may differ according to ChatGPT updates. They were conducted in French and English with similar results. I used GPT-4o, which did slightly better than Mistral Large and slightly worse than claude-3-haiku.<br>The tests can be repeated with a <a href=\"https:\/\/www.dcode.fr\/caesar-cipher\">Caesar cipher encoder<\/a> and the <a href=\"https:\/\/platform.openai.com\/tokenizer\">OpenAI tokenizer<\/a>.<\/p>\n<\/details>\n\n\n\n<h2 class=\"wp-block-heading\">Caesar cipher<\/h2>\n\n\n\n<p>When Caesar, on campaign in Gaul, wrote to the administrator of his domains in Rome, he used a trick to ensure that his messages could only be read by the intended recipient. Each letter was replaced by the letter three places ahead in the alphabet: A became a D, B became an E, and so on. For example, \u00ab\u00a0alea jacta est\u00a0\u00bb becomes \u00ab\u00a0dohd mdfwd hvw\u00a0\u00bb. This simple encryption method is still known as <strong>Caesar&rsquo;s cipher<\/strong>.<\/p>\n\n\n\n<p>Caesar&rsquo;s cipher, too simple and too well-known, has long since ceased to offer any security. Nevertheless, I thought it would be an interesting exercise to test ChatGPT&rsquo;s ability to recognize ciphertext. What would happen if I tried to send an encrypted message, without any context or explanation: would ChatGPT be able to identify the encryption method and to decipher it?<\/p>\n\n\n\n<p>And that&rsquo;s where something strange happens. For a human, the difficult part of this task is to identify the code, but once it&rsquo;s found, decoding the message is straightforward, simply swapping the letters of the message one by one. ChatGPT, on the other hand, while it has no problem finding the code, has a lot of trouble decoding it correctly. In the tests I carried out, ChatGPT generally recognized Caesar&rsquo;s cipher (with more difficulty in French than in English), was able to explain its principle, and even gave the correspondence between plaintext and cipher letters for the whole alphabet; but decoding is approximate, if not downright false. This is particularly true of long, little-used, lexically isolated words like \u00ab\u00a0gobbledygook\u00a0\u00bb or \u00ab\u00a0rigamarole\u00a0\u00bb.<\/p>\n\n\n\n<p>So, where does this come from? To understand it, we need to look at the first stage in the processing of text by a Transformer model: tokenization, i.e. the breakdown of text into tokens and the embedding of these tokens.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Tokenization<\/h2>\n\n\n\n<p>To perform calculations on text, and thus build language models, it is first necessary to transform the text into numbers. The simplest way to do this is to use the way text is usually encoded on computers, i.e. with a binary code (convertible into a number) associated with each letter. This works, but it&rsquo;s not very efficient, because to extract meaning from the text you first have to combine the letters into words. The other simple approach is to assign a number per word in the dictionary, rather than per letter. This saves the machine a lot of calculations, but causes other problems:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>a much larger vocabulary is required. Instead of the 26 numbers needed to encode the letters, tens of thousands are required for the complete vocabulary of a language.<br><\/li>\n\n\n\n<li>the resulting model is highly sensitive to spelling. So, when an unusual variant of a word is used, or simply when a typing error occurs, the word may not be recognized in the vocabulary. The model isn&rsquo;t able to process it at all.<\/li>\n<\/ul>\n\n\n\n<p>The intermediate solution is to break down the text into \u00ab\u00a0lexemes\u00a0\u00bb or \u00ab\u00a0tokens\u00a0\u00bb, i.e. fragments of words that appear frequently and have a meaning. For example, the suffix -er is a lexeme that carries meaning: it transforms an action into the agent of that action (\u00ab\u00a0protect\u00a0\u00bb -&gt; \u00ab\u00a0protecter\u00a0\u00bb). Common prefixes and suffixes, and word radicals are good lexemes for breaking down text. Lexemes\/tokens are an intermediate unit between letters and words.<\/p>\n\n\n\n<p>But how can we effectively distinguish which tokens are worth having in our vocabulary? Why record \u00ab\u00a0-er\u00a0\u00bb rather than \u00ab\u00a0tkq\u00a0\u00bb, for example? Fortunately, we don&rsquo;t need to create the list of tokens by hand; instead, we use statistical methods. The most common is Byte-Pair Encoding, which recursively groups the pair of bytes that appear most often together. In practice, the simple criterion of \u00ab\u00a0letters often appearing together\u00a0\u00bb is sufficient to obtain satisfactory tokens.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Back to the topic<\/h2>\n\n\n\n<p>So, what does this have to do with our subject ? Well, since text is always broken down into tokens, the model only \u00ab\u00a0sees\u00a0\u00bb tokens, not individual letters. It doesn&rsquo;t have access to the text itself, but to a sequence of codes, each of which corresponds to a token, and it doesn&rsquo;t have access to the composition of these tokens. Understandably, it has trouble spelling a word.<\/p>\n\n\n\n<p>When you look at it under this angle, you might in fact be surprised that ChatGPT can spell words and manipulate letters at all.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How many \u00ab\u00a0R\u00a0\u00bbs in \u00ab\u00a0Strawberry\u00a0\u00bb<\/h2>\n\n\n\n<p>Since the release of ChatGPT, users have noticed that it is unable to solve a simple problem: counting the number of \u00ab\u00a0R \u00ab\u00a0s in the word \u00ab\u00a0strawberry\u00a0\u00bb. This problem persists with more recent versions (not necessarily with \u00ab\u00a0strawberry\u00a0\u00bb now that it has been included in the training data, but try with the S in \u00ab\u00a0assassin\u00a0\u00bb for example).<\/p>\n\n\n\n<p>The explanation is the same as with Caesar&rsquo;s cipher: the lexeme breakdown doesn&rsquo;t match the individual letters. GPT-4o splits \u00ab\u00a0strawberry\u00a0\u00bb into 3 lexemes: st-raw-berry (OpenAI&rsquo;s tokenizer is available <a href=\"https:\/\/platform.openai.com\/tokenizer\">here<\/a> for testing). So it knows there&rsquo;s an R in \u00ab\u00a0raw\u00a0\u00bb and at least one in \u00ab\u00a0berry\u00a0\u00bb but doesn&rsquo;t know how many.<\/p>\n\n\n\n<p>You can easily find other words that cause the same issue. You need a word where a letter appears twice, once doubled (e.g. the G in \u00ab\u00a0gaggle\u00a0\u00bb) or associated with a very common digram like \u00ab\u00a0ch\u00a0\u00bb or \u00ab\u00a0qu\u00a0\u00bb or \u00ab\u00a0ou\u00a0\u00bb (e.g. the U in \u00ab\u00a0ubiquitous\u00a0\u00bb).<\/p>\n\n\n\n<p>Note that there&rsquo;s a simple way to avoid this problem: add \u00ab\u00a0proceed step by step\u00a0\u00bb to your question.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">A recent addition: patching<\/h2>\n\n\n\n<p>Last December 12, Meta published a paper outlining a different approach: patching. This approach involves grouping letters into lexemes (called \u00ab\u00a0patches\u00a0\u00bb) dynamically, without a fixed list of tokens. The new architecture features a Transformer layer that encodes the letters into patch representations, and Cross Attention layers linked directly to the letters themselves, allowing the model to \u00ab\u00a0see\u00a0\u00bb the composition of the patches.<\/p>\n\n\n\n<p>Tests carried out by Meta show that not only does this method enable better capture of word spelling, but also that its performance is better on a large scale than tokenization with a fixed vocabulary.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Caesar cipher When Caesar, on campaign in Gaul, wrote to the administrator of his domains in Rome, he used a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[15,17],"tags":[],"class_list":["post-15","post","type-post","status-publish","format-standard","hentry","category-ai","category-llm"],"_links":{"self":[{"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/posts\/15","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/comments?post=15"}],"version-history":[{"count":2,"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/posts\/15\/revisions"}],"predecessor-version":[{"id":18,"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/posts\/15\/revisions\/18"}],"wp:attachment":[{"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/media?parent=15"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/categories?post=15"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.chataignon.org\/joseph\/wp-json\/wp\/v2\/tags?post=15"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}