#1: Agency preserving RL & game theory AGI gyms
Under this topic we explore several variants that focus on algorithmically describing how "agency" and "agency preservation" might be conceptualized or learned, e.g. by reinforcement learning (RL) agents. We can begin simply by viewing agency as a "capacity" to affect the environment (or external world) and limit ourselves to few-agent environments. We can ask easy questions such as quantifying agency and tracking it in simulated games or environments, and we can ask hard questions such as solving previously unsolved problems of organizing or maximizing resources use fairly, or solving inequality among agents.
Easy questions: Is agency quantifiable, e.g. # of states or # changes an agent can make to the world? Do RL agents tend to seek "power-for-themselves" and "disempowerment-for-others" as instrumental goals?
Hard questions: How can we guarantee that (benevolent) superhuman intelligent AI systems - that are essentially alien lifeforms - understand and protect human wellbeing and the human control (i.e. agency) over the world? How would such AI systems balance between competing values and goals to arrive at human-acceptable solutions and human futures - and avoid silly things like paperclip maximization?
To get us started on the "easy" questions there are several related works that might guide us:
Kylubin et al 2005 argue that "empowerment" as an "information-theoretic capacity" (related to entropy) that maximizes the number of states a system may be in could be a universal optimization objective for many systems (also Salge et al 2014). Might "entropy" maximization be a way to describe the evolution of agency and could we extend this concept even more towards a quantifiable notion of agency?
Turner and Tadepalli 2022 focus on decision making functions and show that "power seeking" behavior of RL agents can arise in many cases (where "retargetability" is available). Can we conceptualize objectives that place constraints on AI-system retargetability in the presence of human agents? More interestingly, could we extend such paradigms to analyze whether such agents identify "agency loss" of other agents as instrumental goals?
Franzmeyer et al 2020 repurpose entropy to design "altruistic" helper agents that seek to aid other agents by maximizing their "entropy"-based measures of agency. What are the limits of extending this approach to multi-agents, multi-objectives and more complex environments?
So we are broadly seeking to conceptualize of RL AGIs that might learn to directly preserve agency: i.e. having many options/locations, and many choices (and possibly improve these) - rather than focusing on human intent or truth, accurate value representation, interpretability. But the challenge is on how to do this without negatively affecting the long-term future and outcomes as "agency depletion". Our sketch for agency depletion is that even well-meaning, (i.e. "intent aligned') truthful AIs can gradually remove all but the safest options from an environment - because of risk-reducing and reward-maximizing objectives. (If the AI agent is misaligned/untruthful, then option depletion happens even faster).
To get started on the "hard" questions, we have less guidance. But perhaps we can start by rethinking the "paper clip" maximization paradigm not as a "specification gaming" failure, e.g. the human failure to outline all the rules (i.e. failure on training distribution), nor as a "goal misgeneralization" failure, e.g. the failure of the AI/humans to provide all of the required training data or recognize out of distribution scenarios (i.e. failure on out of distribution). What if paper clip maximization occurs because humanity has not solved the problem of distributing and equally sharing large amounts of knowledge and power - let alone figured out how to write algorithms to this end? That is, what if alignment is not an algorithmic failure - but a (very) difficult conceptualization problem, such as how powerful agents can live and interact with much less powerful ones?
Essentially, we are interested in agency preservation as a pre-learning, pre-misalignment target for safe AIs, perhaps in game or economic theory paradigms. For example, in a paradigm containing a "human" agent and an AGI agent that has already learned the human's reward function and has nearly omnipotent control over the environment, what does the AGI optimize in assisting the human? That is, how does the AGI balance between all the (true) needs, goals and desires of the human at every time step to make an action or recommendation? And how does the AGI do this when you add many other humans to the environment?
This question goes beyond merely the problem of AGIs recommending "suboptimal" solutions and towards central - but vastly understudied - questions in alignment relating to the long-term effects of AI actions on the "operator" and other agents. Whereas "intent" alignment and "preference" satisfaction focus more on human evaluation of the immediate outcome of AI/AGI actions (and is the basis of RLHF in LLM tunning) - here we ask questions about whether AI alignment is a problem of algorithmically defining long-term and distributed effects rather than immediately observable effects:
Can we prove theoretically that alignment is impossible if framed as a local, i.e. AGI-single user, problem?
How would an AI that reports true states of the world and its beliefs still lead to the destruction of humanity (e.g. by manipulating the types of questions humans ask and the types of values humans have)?
Can we establish computationally that if AI systems are embedded with agency-evaluating* modules, that goal misgeneralization and specification gaming is no longer catastrophic for humans?
Is alignment an AGI vs enduser problem or an AGI vs humanity problem? That is, must safe AGIs consider the entirety of humanity when acting in the world? How do we even think about this? (for some ideas see the discussion on Human Rights in preprint).
* We note that agency-evaluation can suffer from similar, but arguably less harmful, failures as "intent"-fulfilment.
The sketch of this looks like the problem of trying to get a completely benevolent organism with high intelligence (an AGI) to figure out what a lower intelligent organism without harming it or doing long-term damage. For example, trying to use symbols to get a cow to respond correctly to a complex decision problem. How would we truly evaluate how the cow feels about the solution or problem formulation? Truth doesn't help guide us (the AI is not deceitful and knows what the cow values perfectly); interpretability is not even relevant; ontological identification - who knows how to explain any concept to a cow? And how do we ensure that what the cow wants won't destroy, enslave etc. other organisms?
As more practical projects, could we design AGI gyms (or simple paradigms) where super agents (i.e. having many capacities not available to others) interact with ordinary agents for long periods of time without harming them? What might agency preservation and equilibria look like in these paradigms? Are the only solutions here (equilibrium etc.) that: (i) AGi/AIs must necessarily become "part" of the organism they interact with (what does this even mean)? or (ii) that AGI/AIs must never make agency-related decisions (how would we ever stop this)?