It suddenly occured to me to ask a side question of whether it could recognize this character from these contour points, and it could. It said it "looked" like the letter 'a'.
You only asked it once. It could be any of:
- You got one lucky answer. I could be that chat bot are bad at classifying glyphs, you just got one lucky answer (see: spelling questions. Until byte-based input handler are a thing, a chatbot doesn't see the ASCII chars forming a string, it only sees tokens -- very vaguely, but good enough metaphor: words) (Same category: all the usual "Chat bot successfully passes bar/med school/Poudlard final 's exams better than average students" press releases). But give it any other glyph of your current work and the overall success rate won't be higher than random.
- It actually answers "a" to everything. Give it any ROT-13 encrypted text, the older chatGPT on which we tested that always answers "Attack at dawn" (see: Cryptonomicon)
- LLM could be not too bad classifiers. It's possible that (having trained on everything that was scrapable off the internet) the LLM's model has learned to clasify at least some gylphs better than random. (Again you'd be surprised at what was achieved in handwriting recognition with mere HMMs) (And LLMs have been applied to tons of other pattern-recognition task that aren't really human language: it's been used in bioinformatics for processing gene and protein sequences)
Just as an exercise: take a completely different non-letter vector object encdoded the same way. Replace the vector in your question with the bogus vector, and keep the rest of the question as-is, including any formulation that would be putting the bot on a certain track (e.g.: if the original question was "what letter of the alphabet is encoded in the following glyph" keep that part as-is). And ask it to explain what it saw using the exact same question.
Repeat with multiple fonts but other letters, and multiple non-letter objects.
Does it consistently answer better than random? Somehow recognise the non-letters (calling them emojis or symbols if the prompt forced it to name a letter)?
Or does it call everything "a"? Or does it only successfully recognises "a"s and "o"s but utterly fails to recognize "g"s or "h"s ?
How in the world can a language model do that?
Again. HHMs, LLMs trainged on bioinformatics sequences.
The data had been normalized in an unusual way and the font was very rare. There is zero chance it had memorized this data from the training set.
"rare" and "unusual" don't necessarily means the same thing to human eyes and to a mathematical model.
It doesn't need to literaly find the exactt same string in the training set (it's not C/C++' strstr() ), it merely needs to have seen enough data to learn some typical properties.
And if you look at how some very low power retro tech used to work for handwriting recognition: Palm's Graffiti 1 didn't even rely on any form of machine learning. Just a few very simple heuristics like total length travelled in each cardinal direction, relative position of the start and stop quadran, etc.
So property could be "the vector description is very long" which well within what a typical language model could encode.
And again that's assuming that the chat bot consistently recognises glyphs better than random.
It absolutely must have visualized the points somehow in its "latent space" and observed that it looked like an 'a'.
Stop anthropomorphising AI, they don't like it. :-D
Jokes aside: LLMs wont process any thing visually. Just most like words given a context and their model. Yes a very large model could encode relatioship between words that corresponds to visual propertise. And it is plausible that given enough scraped stuff from the whole internet, it has learn a couple of properties of what makes an "a".
Buit yet again that's assuming that the chat bot consistently recognises glyphs better than random.
I asked it how it knew it was the letter 'a', and it correctly described the shape of the two curves (hole in the middle, upward and downward part on right side) and its reasoning process.
It's an LLM. It's not "describing what it's seeing". It's giving the most likely answer to what an "a" looks like based on all it has learned from the internet.
Always keep in mind that what a chatbot gives you isn't "What is the answer to my question?", but it gives "How would an answer to this question convicingly look like?".
There is absolutely more going on in these Transformer-based neural networks than people understand.
Yes, that a totally agree. An oftern completely underlooked aspect is the interpretation that goes in the mind of the homo sapiens reading the bot's answers.
I could joke about seeing Jesus in toats, but the truth is that we are social animals, we are hardwired to assume there's a mind whenever we see a realistic and convinving language. Even if that language is "merely" the output of a large number of dice rolls and a "possible outcomes look-up table" with a size thats incomprehensible to the human mind.
It appears that they have deduced how to think in a human-like way and at human levels of abstraction, from reading millions of books.
"appears" is the operative key word here. It's designed to give realistic sounding answers.
Always. It always answers, and it always sounds convincing by design, no matter how unhinged the question actually is.
The explanation looks "human-like" because the text-generating model has been training on a bazillion of human-generated texts.
In particular they have learned how to visualize like we do from reading natural language text.
Nope. Not like we do, at all.
But they are good at generating the text that makes it look like that it would be like we do.
Because again, they are good at language and that what they are designed to do.
At absolute best situation, one of the latest generation multimodal models, that not only do text but also is designed to do process image (the kind of chatbot to which you can upload image, or which you can ask to generate image), could be generating some visuals from the prompt and then trying to text-recognition on that intermediate output.
It wouldn't surprise me at all if they can visualize a chess board position from a sequence of moves.
Given a large enough context window, it could somehow keep track of piece positions.
But what pro players seem to report is that current chatbot actually suck at that.