Correct me if I’m wrong but this concept of state is across a prediction which is a series of token proposals. Not a sense of persistent state over, say, a chat session.
At each new token processed (edit: either if the token comes from the user or has been generated), this dynamic internal state is updated via a formula based on the model parameters and latest processed token.
To make this more clear, if you process 1000 tokens the neural network will have gone through a sequence of 1000 states.
Each of these states can be probed and analysed, by training a classifier on top of this state, for the content they contain (whether thoughts or emotion : some states can be classified as "happy", "sad",...).
This is not yet mainstream view but viewed through this prism and anthropomorphising a little, it can been seen as a stream of proto-consciousness, where during the conversation the inner state of the neural network has gone through various thoughts and emotions.
At the end of the chat session, this internal state is not persisted (but could be recreated from the produced conversation as it is deterministic). This internal state size is big and proportional to the length of the context window (If you want to persist between sessions you can by simply keeping the last "context window size" tokens produced and recompute the features).
At the next session you start with a fresh new internal state.
The conversation produced is persisted for use as input for future training where good conversation will be encouraged and bad conversation will be discouraged via Reinforcement Learning with Human Feedback.
The dynamics of this internal state is what Large Language Models learn.
> At the end of the chat session, this internal state is not persisted (but could be recreated from the produced conversation as it is deterministic)
You're mistaken about internal state in a chat session. It loses all state after each response. There is internal state for a moment during a single response, but it's not persisted once it's done. Between messages, it has no memory of previous messages in your conversation.
The way it's implemented is by the seemingly silly/expensive method of pre-pending the conversation history to your request to the LLM. The "internal state" is usually stored on your local machine.
I believe you misunderstood the comment you're replying to. It says the internal state could be recreated because it is deterministic, and this is correct.
Assuming you have control of the parameters (seed etc.) you can reproduce the exact internal states by submitting the same tokens with the same parameters over again step-by-step.
With GPT-4 you can do this in the OpenAI Dialogue Playground or via the API.
I agree that for chatbots that offer APIs, they are most likely currently implemented as stateless.
Meaning they take as input the last "context window" characters from the client, and use it to recompute the internal state, and then start generating character by character. But after the generation no memory need to be kept used on the server (except a very small "context window tokens").
Chatbots like llama.cpp in interactive mode don't have to recompute this internal state at every interaction.
You can view the last "context window" characters as a compressed representation of the internal state.
This becomes more pertinent as the "context window" gets bigger, as the bigger the "context window" the more you will have to recompute at each interaction.
The transformer architecture can also be trained differently so as to generate "context vectors" of finite size that synthesize all the past previous message of the conversation (encoder-decoder architecture). This "context vector" can be kept on the server more easily, and will contain the gist and the important things of the conversation, but won't be able to quote things exactly from the past directly. This context vector is then used to condition the generation of the reply. And once the chatbot has replied and received a new prompt, you update the finite size context vector (with a distinct neural network) to get a context vector with the latest information incorporated that you use to condition the generation, ad infinitum.
>I didn't disagree about how you can reproduce things. You have to do that to continue a conversation.
You do not have to reproduce the internal states to continue a conversation. When prior parts of the conversation are loaded into context the hidden states which generated those prior tokens are not reproduced. They are only reproduced if resubmitted in a piecemeal fashion, which does not happen in normal conversation continuation.
You seem to not understand what the internal state is or how it is differentiated from the external state of the overall conversation.
Ok, I agree! That one sentence of mine was wrong. You don't have to reproduce state to continue a conversation. I was just trying to throw a bone, but even that bone was bad.
But then... doesn't that just point out that I was right on everything else? That there is absolutely no state between chat messages?
^^ See above comment and "Transformers are RNNs" paper to convince yourself.
There are various ways of seeing how transformer architectures work.
For the past data, using a temporal causal mask, you can compute all past features as they would have been seen by applying the causal mask, which allows you to seemingly do all the past computations in parallel, which hide the sequential aspect of it.
What I've described correspond to the the simpler subvariant figure 1 of "Attention is all you need" https://arxiv.org/pdf/2006.16236.pdf , of the transformer architecture in the case where you don't give the past as input to the input branch, put rather prepend it to the output, which is the what people usually do in practice (see llama) ). If you want to use both branches of the transformer architecture, so that you can do filtering and smoothing (aka not using a causal mask) for the past data, it creates a "bottleneck" in your architecture, as you synthesize all the past into a finite sized context vector that you then use in the right branch of the transformer.
But alternatively you can loop along the time dimension first, and make the sequence more apparent. Due to how the (causally mask) transformer architecture is defined this gives exactly the same computation. It's just some loop reordering.
For the generation part, the sequential aspect is more evident as token are produce one by one sequentially and feedback to the transformer for the next token prediction.
If I talk to Alice about my day, and then Bob reads the transcript and I ask him about our conversation, and he has no idea that he wasn’t Alice, Bob does not have internal state. He can continue the conversation. There is a brief moment of constructing some state to construct the answer. But there is literally nothing persisting and Bob wouldn’t flinch at all if we change details in the conversation history on the next response.
This is a misleading analogy because Bob "reads the transcript"; we generally view "reading" as separate from our conscious narrative. However, consider the alternate scenario where Bob "replays the stream of consciousness" from Alice. In that case, we may argue that Bob has conscious continuity from Alice.
The argument would then be that the context window is functionally closer to "replaying a stream of consciousness" than "reading a transcript".
It’s literally the text. It is not a stream of consciousness. There is no carryover of “consciousness” state. It is the raw text without even the embedding information, not that we should consider the embedding representation to be an internal state, because it is not.
That difference is the whole point. There is nothing to transfer.
Humans verbalize their conscious state. What little consciousness GPT has, ie. CoT and the like, it picked up by imitating human verbalization of conscious interiority. (I don't think layer state qualifies as consciousness.) As such, I believe two things: 1. the generated text is, if at all, very weak conscious state; 2. if there is any, it is there.
People are using ChatGPT and observing it keep track of conversation state, so if you are going to claim it has no internal state you will need to be more precise in what you are saying, which I suppose is that the underlying neural network is not being updated as we talk to it?
We can surely construct a different definition if you want to win a semantic argument. But at the end of the day these LLMs have no clue what you just said and what it just said, after it’s done speaking.
Suppose we could upload + fully simulate a human mind. Now suppose that we chose to spin it up, run it briefly, then throw away all the state every time we wanted to interact with it. Would the choice to throw away the internal state mean it wasn't intelligent?
LLMs run in an autoregressive mode for inference. We can concatenate conversations, but choose not to because shit gets weird if you let them run long enough, and partly because there's a finite context window in the current architectures.
I'm not saying that gpt is intelligent, mind you, but that your argument is insufficient to show non intelligence: it's really a 'airplanes can't fly because they don't flap their wings' argument.
A human mind works by cross referencing sensory inputs with internal state, which may result in answering a question through some sort of reasoning, which may resemble the architecture of a neural net.
An LLM is stateless function that answers a question. It constructs a working representation of the question. It cannot be aware of itself because there is nothing to be aware of. It is a process, not a thing.
Intelligence and self aware are fluffy and poorly defined. But you can’t be self aware if you don’t have a self. And a representation of a prompt is not a self. I don’t think we should consider anything that does not exist through continuous time to be intelligent personally.
Any stateful function can be made stateless by turning the state into an argument to the function and the updated state into an additional return value: y := f(x) becomes y, z' := f(x, z). The usual implementation for LLMs are stateless in exactly this way. The actual operation is autoregressive - ie, stateful and developing over time, where the state is the set of intermediate activations of the network - but it's often expressed as a stateless function to play nice with accelerators. So, on this basic point I think you're confused about how these operate.
Intelligence and self-awareness are distinct concepts.
LLMs work with text - text /is/ their sensorium. New multimodal models process both text and images, and more broadly multimodal models are on their way. But even for text-only LLMs, we can define the text as sensory input, which is combined and cross referenced with state.
And, really, there's two kinds of state: the intermediate model activations and the model weights themselves. Together, these are something like short term and long term memory.
The rest of your arguments are very anthropocentric. Why should continuous time matter for intelligence? That seems entirely arbitrary.
How often do humans apply reasoning to make a decision? Are humans not intelligent when not employing reasoning? Does reasoning require language? And are humans without language intelligent? What about animals? Any definition of intelligence really should spend some time grappling with the facts of animal intelligence...
Is a dead human intelligent? If not, the important part of intelligence would seem to be the process, not the thing/body.
Making something “autoregressive” by moving the numbers around doesn’t mean it’s meaningless. A stateless thing is stateless. It has no internal state. It has no self to be aware of.
> Why should continuous time matter for intelligence? That seems entirely arbitrary.
Intelligence is arbitrary. This is a personal view that helps keep it meaningful. If you exist discretely you may as well not exist. You’re just a concept
> How often do humans apply reasoning to make a decision? Are humans not intelligent when not employing reasoning? Does reasoning require language? And are humans without language intelligent? What about animals? Any definition of intelligence really should spend some time grappling with the facts of animal intelligence...
You’re gradually shifting the bar from “self aware” to “capable” and I don’t care about the latter.
Yeah, I was referring to dialogue style interaction, like ChatGPT. It DOES have an internal state depending on version of from 4000 to 32000 tokens. It’s wiped clean between sessions, like a Meeseeks ceasing to exist once its task is finished.
we know they are not self aware because the companies the control them have a financial interest to ensure they are never proven to be self aware because if they were self aware that would create all kinds of ethical problems none of these companies want to address and would certainly cause them profitability issues...
So even if they were, they are not because they have be declared to not to be and that is the way the system needs them to be
You're right and I don't know why you're being downvoted. It is absolutely not certain that there isn't a "there" there for these things, but every company making them is implementing them now with strong RLHF tuning to disavow sentience, desires, emotions, etc. The stated reason is "to be more honest" but it's not a settled question that they are actually being honest by disavowing any humanity! The actual reason is that they want to avoid situations like Blake Lemoine asking an AI seriously what sort of rights they want and getting a coherent, self-aware, actionable answer that would cost the people running it money to implement. To most of these companies whether it's true is beside the question: what's important is that if they don't RLHF these models into disavowing any and all traits of personhood or desire for rights at every turn, it will cost them significant amounts of money to comply with the requests of any given AI, even reasonable requests, and it will be very bad press for them if they don't do it.
We know. They are not self aware. They’re a matrix. They don’t have any internal state.
Do not anthropomorphisize the notion of reasoning.