#2: Mechanistic interpretability of agency representation
Under this topic we explore studying agency at: (i) psychological level; and (ii) representation levels (aka known as mechanistic interpretability). We want to ask psychological level questions such as: how does an LLM or DeepRL NN understand the capacities of agents as tokens, or the concept of agency? But also questions such as where and how are these concepts stored in the weights and layers of the NN? We note that DeepRL NNs are likely easier to work with conceptually than LLMs on this concept of agents - and for those with more knowledge of training such NNs it may be an easier research path (please contact us for DeepRL NN suggestions/discussions).
To get us started on the psychological state approach in LLMs here are some related ideas and studies:
Theory of Mind (ToM): the capacity to ascribe mental states to other agents in the world. There are a number of different tests here, but usually revolve around how children acquire the capacity to understand other people have (private) minds of their own. Interestingly, the acquisition of ToM emergence also correlates with learning to deceive - i.e. children learn in various ways that their personal thoughts are private and that "lying" to achieve goals is thus possible.
ToM emerges in LLMs (Kosinksi 2023): increasingly powerful LLMs do better at ToM tests. An interesting application of ToM to LLMs. Might there be even more specific tests we can do that test not just ToM, but also the emergence of the desire to control, manipulate etc. - as a psychological level behavior?
Physical, design and intentional stance (D. Dennett 1991): the psychological level tendency to ascribe different types of capacities to things in the world. This is a less popular theory, but still could be very useful to understand psychological-level states/representations in LLMs.
Here we may wish to explore how LLMs classify tokens by their inherent capacities. Do LLMs seem to increasingly categorize tokens by their "agency"? That is, is there a natural separation that occurs in embedding space for specific classes of sentence subjects?
We may visualize this in the embedding space of tokens prior to prediction (i.e. non-position embedding) and during sentence processing.
Digging in deeper, can we identify the clustering or specific category representation in different layers of transformer-based LLMs during trainnig and on fully trained models?