Despite all the advances AI has made, it still can't write. If you ask text and image generators like DALL-E to create a menu for a Mexican restaurant, they'll spit out delicious images and awkward captions.
“Image generators tend to be better at artifacts like cars and people's faces, and worse at fine details like fingers and handwriting.”
Asmelash Teka Hadgu, co-founder of Lesan and researcher at the DAIR Institute
The technologies behind image and text generators are different, but both types of models have similar difficulties with the following details:
Image generators typically use diffusion models that reconstruct images from noise.
When it comes to text generators, large language models (LLMs) may seem to read and respond to clues just like the human brain, but in reality they use complex mathematics to match the pattern of the clue to a pattern in their latent space - which is how they give the answer.
Algorithms are interested in recreating something similar to what they saw in their training data, but they don't know the rules that we take for granted - for example, "hello" is not spelled "heeelllooo", and in humans hands usually have five fingers. In other words, the neural network produces fingers that look like humans, but does not know how many there are, and it produces images of letters based on their similarity, but does not structure them.
And while ChatGPT can write abstracts, it is comically incompetent when asked to come up with a 10-letter word without using "A" and "E" (for example, ChatGPT will suggest "balaclava").
A video was posted on our Telegram channel in which a user asked ChatGPT to draw the word “Honda” using ASCII art characters. He eventually succeeded, but not without some Odyssean trials. Presumably the problem is that the AI was not trained with enough ACSII art.
Engineers can solve the finger problem by augmenting their data sets with training models specifically designed to teach AI what hands should look like. But experts don't expect spelling problems to resolve so quickly. And especially when you consider how many different languages AI has to learn.

But fundamentally LLMs just don't understand what letters are, even if they can write poetry in seconds. LLMs are based on the architecture of a transformer, which, remarkably, does not read text. When you enter a query, it is translated into encoding. When the AI sees the word "hello", it has one encoding of what the word means, but it doesn't know the letters individually. But the history of an ABC book or even an individual letter can easily be quoted from Wikipedia.
By the way, if you look closely, AI makes mistakes not only in fingers and writing letters. These models make small, local errors all the time—it's just that humans are especially good at recognizing only some of them.
To the average person, the generated image of a music store may be believable. But anyone who knows a little about music will see that some guitars have seven strings or that the black and white keys on a piano are placed incorrectly.
Although AI models are improving at a tremendous rate, these tools still face similar problems, which currently limits the technology's capabilities (and protects people from numerous deepfakes).
Leave a comment
Your email address will not be published. Required fields are marked *