Symmetry between two types of anti-threat measures

Agents that threaten to harm other agents, either in an attempt at extortion or as part of an escalating conflict, are a possible mechanism for how s-risks could come about.

Consider two approaches for preventing threats. First, we could attempt to alter the reaction of potential threat targets so that the instantiation of large amounts of disvalue becomes less likely, e.g. by introducing surrogate goals. Second, we can try to influence potential threateners, e.g. by making them less inclined or less able to use threats in conflicts.1 I will refer to these as type 1 and type 2 anti-threat measures.

I think there’s a theoretical reason to assume that the two approaches are – at least prima facie – comparably important. The basic idea is that whenever a threat is carried out, you have two agents that are “responsible” for that outcome, the threatener and the threatenee.2 

As an analogy, suppose you want to save lives in a war by influencing the behaviour of average soldiers. Now, you could do this either by trying to make it less likely that they are killed (a type 1 intervention), e.g. by using particularly cautious combat tactics; or you could make it less likely that they kill others (type 2) – say, by promoting a “live and let live” mindset.

Now, consider a) the sum over all soldiers of the expected number of kills made by that individual, and b) the sum over all soldiers of the probability that this individual will be killed. Both sums are actually the same value; namely, the expected number of total casualties.

So, for a randomly sampled soldier, the expected number of kills must be equal to the probability of this soldier being killed, and the prevention of either is (bracketing second-order considerations) equally important. Note that this basic argument does not depend on the distribution of harm caused by different agents (or the distribution of kills). It holds even if the distribution is very lopsided.

Similarly, for a randomly sampled agent, preventing the harm caused by this agent threatening others is prima facie just as important as preventing the harm from threats against that agent.

The assumption that our intervention influences a randomly sampled agent is critical. For example, if we had reason to believe that our intervention targets elite soldiers that are particularly likely to kill others and less likely to be killed themselves, then type 2 measures are more important.

This random sampling assumption holds as long as we’re sufficiently ignorant about the actors involved, which is plausible for at least some relevant scenarios. For instance, we could model the future of our civilization – or a superintelligence built by such a civilization – as randomly sampled from the set of all advanced civilizations (or their superintelligences), given that we don’t seem to be able to reliably predict how our civilization will differ from others. So the argument works for conflicts between such advanced civilizations. (It’s not clear, though, whether there is more than one such civilization or whether they can meaningfully interact with each other).

Of course, the argument is only a prior, so additional knowledge can easily tip the balance. Perhaps preventing threats from being made is more tractable – for instance, you could use tripwires or adversarial architectures to ensure that powerful AI won’t threaten others. Or perhaps the problem of figuring out how to react “correctly” to threats can be passed on to superintelligent AI systems in a bootstrapping process. In this case, what matters most is that key actors in the future will be disinclined to use (illegitimate) threats in conflicts and will spend part of their resources to coordinate effectively to install anti-extortion measures.

On the flip side, if most of the disvalue from threats is due to a few “bad actors”, it is possible that our influence on them is smaller than the influence on a random actor. For instance, perhaps unaligned AI systems are more likely to try to extort others than aligned AI systems and we have less influence on what unaligned AIs do in threat situations. In that case, it’s more important to improve the reaction of aligned AIs, e.g. by ensuring that they have the correct decision theory or by implementing surrogate goals.

Footnotes

  1. This isn’t a clean distinction because actions in threat situation depend on predictions of what the other party will do. For example, potential threatenees are less likely to give in if it’s common knowledge that a certain agent’s threats are not credible. So there are interventions that (directly or indirectly) combine type 1 and type 2 anti-threat measures. But I think the distinction is still useful as a practical categorization.
  2. Of course, the threatenee isn’t necessarily responsible in the sense of moral blame – that would be victim blaming.

Leave a Reply

Your email address will not be published. Required fields are marked *