#4: Agency preservation as an AI alignment and governance field
Under this topic we are interested in a range of approaches for characterizing and/or criticizing agency preservation as well as more traditional approaches (e.g. intent alignment) to AI safety and governance topics. Some suggestions include red-teaming but also seeking to improve and expand the meaning of agency-preservation at both technical/algorithmic and governance levels.
Broadly, does agency preservation as an objective for AI optimization make sense: is it a conceptually better algorithmic goal than "truthful" reporting, "intent" alignment or "value" identification?
Can "guarantees of agency preservation" conceptually replace "guarantees of truth" as a goal of technical AI safety research? That is, if we fundamentally care about human control and survival in the world - can agency preservation (or a similar goal) do better at keeping humans safe and in control of the future that doesn't depend on human being required to interact with AIs and evaluating the "truth" of outputs. As a red-team example, could AI systems figure out ways to always report the truth while removing human being power over the world: manipulating what humans "want" to know the truth about, manipulating what humans value etc.?
Does "agency evaluation" improve/decrease/fail to solve some of the main pathways to x-risk? So, can agency evaluation decrease or mitigate specification gaming failures arising from "bad feedback" (e.g. by providing more universal metrics for good model behavior). And does agency evaluation decrease reward misgeneralization failures such as failing to detect out-of-distribution (OOD) inputs? For example, by providing better safeguards for when OOD occurs.
Here the sketch is a bunch of scenarios where failing to account for agency preservation in our objective functions leads to significant or complete loss of control over the world for humans.
Agency loss in intent-aligned AI-system optimization functions. (a) Left: sketch of an optimization function for an inconsequential task such as entertainment recommendation indicating the general space for all solutions vs the agency-preserving minimum. Right: optimization function for a more complex decision has a more difficult to find agency preserving minimum as well as a significant better well-being outcome than average solutions. (b) Without explicit representation of human agency - intent aligned AI optimization objectives may completely ignore (or flatten) space that represents agency-preserving solutions. (c) Human preferences or goals can be simplified (less complex shapes, shallower depth) from repeated AI-human interactions.
What does agency preservation governance look like? E.g., what kinds of regulations and rules are required? For example, outlawing "dark patterns" about manipulation of intent, i.e. outlawing AI companies from tricking users "to engage in unwanted behaviours".
How would we feasibly compute the agency of many or all humans?
How would we enforce "agency" evaluation - or long-term safety simulations/computations etc.?
What types of practical incentives can we provide to motivate agency evaluation?
If we visualize AI-human interactions as a markov decision process (MDP) we want to think more about how an AI systems could practically evaluate the agency of the main human agent - and possibly all the other agents that could be affected (see preprint Fig 6 for more details and suggestions). Is this practically possible? And if not, does it have specific implications for the development and deployment of superhuman inteligent AI systems?
Agency preservation in an MDP model of AI-aided human decision making. (a) Simplified model of AI-system aided decision making paradigm using MDP framework (note: H is the main human agent; AI is the AI-system; G_a1 is H’s goal a at t=1; W_a11 is the goal of another agent (i.e. agent 1) at t=1). At t=1, a human agent H_1 selects a goal G_i from all available goals and shares it with AI for optimization. AI makes an action recommendation to H (solid red line) that is intent aligned to G_i – while the recommendation has indirect effects on H’s other possible future goals and on all other agents’ possible goals (red dashed lines). (b) Expanded diagram from (a) showing that AI’s actions or recommendations have effects on vast numbers of goals of the main agent and all other (human) agents in the world. We note that many or most of the goals of other agents are not known or knowable to the AI or H. (c) As a tractable solution AI must verify actions against basic principles, such as essential human rights, prior to recommendation.