Comment Re:It is just a bit better google search for me... (Score 1) 268
But where do all of those embeddings go? In a vector database, right?
SIGH.
No.
And not because "embeddings" aren't stored in a database (they're not even stored, they're a transformation of YOUR input text). And not because NO part of the model is stored in a database (the whole thing is loaded into VRAM). It's because neural networks don't have some collection of facts that they just search through.
For simplicity's sake I'll leave out the attention blocks and just focus on the linear networks. How they work is: picture the most insane flow chart your mind could possibly conceive of. Billions or even trillions of nodes and lines. Many thousands of lines leading to and from each node. Each node not being yes/no, but any degree of "maybe", and different input lines having entirely different amounts of significance to the decision. And each node not answering a single question, but a superposition of questions, the individual answers of which only get teased apart later.
It's perhaps easiest to understand with image recognition nets. Here's a useful primer.
But, TL/DR: all the information is the result of stepping through this insane flow chart. There is no "iterating over some database" stage It's all in the logic. With heavy overlap between concepts. For example, when Anthropic detected and boosted the cluster of neurons whose collective action fires when the topic of the Golden Gate Bridge comes up, it didn't just boost the bridge, but also the colours of the bridge, Alcatraz Island, the Muir Woods, San Francisco, the Pacific Ocean, tule fog, on and on - everything connected to the concept.
Latent spaces are conceptual spaces. Every hidden state vector represents a point in N-dimensional space (where depends on the model, but is usually hundreds). The more related two concepts are, the closer to the same space they occupy in the vector (as measured by cosine distance). LLMs work by many layers of processing of many hidden states. For example, if you had the word "bank" (let's just pretend 1 token = 1 word, though it's not like that), it might mean a financial bank or a river bank. But if there were words related to water relating to that word,then the model would shift the position of that bank vector in the direction of the water-related vectors. Now it's mapping to some other part of the latent space that no longer maps directly to a word, but contains a much more precise "conceptual position". Atop this, the attention mechanism allows the model to focus on the specific tokens of relevance rather than the entire vector attracting everything evenly at once. This all happens again and again and again, allowing ever-more elaborate operations to chain off each other repeatedly. It's turing-complete, and fully self-assembled.
Otherwise, how would Bing's "AI" results give you links to the websites whose content it mainly used to synthesise its answer?
Bing uses RAG, Retrieval Augmented Generation. It's the combination of a LLM with a search engine. Two separate things being invoked together. The simplest forms of RAG use a very lightweight summarization model that knows nothing on its own but only knows how to summarize things from other pieces of text. More complex RAG models have large amounts of information of their own, but are also fed the results of queries (or can even invoke queries on their own), so when they process the input, they have the added external context.
The queries are not a fundamental part of the LLM. That's an external add-on. In many cases the results are literally just appended into the chat history. It has nothing to do with the inner workings of the LLM, which is a distinct and self-contained element.
My understanding is that this is what you do when you set up your own: you install a vector database like Qdrant,
You never use a database at any point in the training process, neither in generating the foundation nor in doing the finetune.
divide it into overlapping chunks to calculate embeddings for
You do divide it into overlapping chunks (if we're talking about the foundation, you don't do that with the finetune). This is what you store until you're ready to start training.For training, the first step with text is tokenization. You iterate one token at a time. The net result of the training process is that the model tries to predict the next token, and you get an error metric of how far off it was on each dimension. These errors backpropagate through the model, essentially slightly shifting all of the weighs and biases in the direction of what would have been closer to the correct answer for that token. These shifts are tiny, veritably homeopathic - ~1e-5 or so. But over time, the model incrementally gets better at predicting what the next token will be.
Databases NEVER come into the picture.
Searching NEVER comes into the picture.
Please understand this fact.