Category: LLM

  • Why ChatGPT can’t spell

    disclaimer

    The experiments presented here were not all carried out at the same time, and the results of a particular test may differ according to ChatGPT updates. They were conducted in French and English with similar results. I used GPT-4o, which did slightly better than Mistral Large and slightly worse than claude-3-haiku.
    The tests can be repeated with a Caesar cipher encoder and the OpenAI tokenizer.

    Caesar cipher

    When Caesar, on campaign in Gaul, wrote to the administrator of his domains in Rome, he used a trick to ensure that his messages could only be read by the intended recipient. Each letter was replaced by the letter three places ahead in the alphabet: A became a D, B became an E, and so on. For example, “alea jacta est” becomes “dohd mdfwd hvw”. This simple encryption method is still known as Caesar’s cipher.

    Caesar’s cipher, too simple and too well-known, has long since ceased to offer any security. Nevertheless, I thought it would be an interesting exercise to test ChatGPT’s ability to recognize ciphertext. What would happen if I tried to send an encrypted message, without any context or explanation: would ChatGPT be able to identify the encryption method and to decipher it?

    And that’s where something strange happens. For a human, the difficult part of this task is to identify the code, but once it’s found, decoding the message is straightforward, simply swapping the letters of the message one by one. ChatGPT, on the other hand, while it has no problem finding the code, has a lot of trouble decoding it correctly. In the tests I carried out, ChatGPT generally recognized Caesar’s cipher (with more difficulty in French than in English), was able to explain its principle, and even gave the correspondence between plaintext and cipher letters for the whole alphabet; but decoding is approximate, if not downright false. This is particularly true of long, little-used, lexically isolated words like “gobbledygook” or “rigamarole”.

    So, where does this come from? To understand it, we need to look at the first stage in the processing of text by a Transformer model: tokenization, i.e. the breakdown of text into tokens and the embedding of these tokens.

    Tokenization

    To perform calculations on text, and thus build language models, it is first necessary to transform the text into numbers. The simplest way to do this is to use the way text is usually encoded on computers, i.e. with a binary code (convertible into a number) associated with each letter. This works, but it’s not very efficient, because to extract meaning from the text you first have to combine the letters into words. The other simple approach is to assign a number per word in the dictionary, rather than per letter. This saves the machine a lot of calculations, but causes other problems:

    • a much larger vocabulary is required. Instead of the 26 numbers needed to encode the letters, tens of thousands are required for the complete vocabulary of a language.
    • the resulting model is highly sensitive to spelling. So, when an unusual variant of a word is used, or simply when a typing error occurs, the word may not be recognized in the vocabulary. The model isn’t able to process it at all.

    The intermediate solution is to break down the text into “lexemes” or “tokens”, i.e. fragments of words that appear frequently and have a meaning. For example, the suffix -er is a lexeme that carries meaning: it transforms an action into the agent of that action (“protect” -> “protecter”). Common prefixes and suffixes, and word radicals are good lexemes for breaking down text. Lexemes/tokens are an intermediate unit between letters and words.

    But how can we effectively distinguish which tokens are worth having in our vocabulary? Why record “-er” rather than “tkq”, for example? Fortunately, we don’t need to create the list of tokens by hand; instead, we use statistical methods. The most common is Byte-Pair Encoding, which recursively groups the pair of bytes that appear most often together. In practice, the simple criterion of “letters often appearing together” is sufficient to obtain satisfactory tokens.

    Back to the topic

    So, what does this have to do with our subject ? Well, since text is always broken down into tokens, the model only “sees” tokens, not individual letters. It doesn’t have access to the text itself, but to a sequence of codes, each of which corresponds to a token, and it doesn’t have access to the composition of these tokens. Understandably, it has trouble spelling a word.

    When you look at it under this angle, you might in fact be surprised that ChatGPT can spell words and manipulate letters at all.

    How many “R”s in “Strawberry”

    Since the release of ChatGPT, users have noticed that it is unable to solve a simple problem: counting the number of “R “s in the word “strawberry”. This problem persists with more recent versions (not necessarily with “strawberry” now that it has been included in the training data, but try with the S in “assassin” for example).

    The explanation is the same as with Caesar’s cipher: the lexeme breakdown doesn’t match the individual letters. GPT-4o splits “strawberry” into 3 lexemes: st-raw-berry (OpenAI’s tokenizer is available here for testing). So it knows there’s an R in “raw” and at least one in “berry” but doesn’t know how many.

    You can easily find other words that cause the same issue. You need a word where a letter appears twice, once doubled (e.g. the G in “gaggle”) or associated with a very common digram like “ch” or “qu” or “ou” (e.g. the U in “ubiquitous”).

    Note that there’s a simple way to avoid this problem: add “proceed step by step” to your question.

    A recent addition: patching

    Last December 12, Meta published a paper outlining a different approach: patching. This approach involves grouping letters into lexemes (called “patches”) dynamically, without a fixed list of tokens. The new architecture features a Transformer layer that encodes the letters into patch representations, and Cross Attention layers linked directly to the letters themselves, allowing the model to “see” the composition of the patches.

    Tests carried out by Meta show that not only does this method enable better capture of word spelling, but also that its performance is better on a large scale than tokenization with a fixed vocabulary.