An introduction to worst-case AI safety

Introduction

The burgeoning field of AI safety has so far focused almost exclusively on alignment with human values. Various technical approaches have been suggested (1, 2, 3) to ensure that powerful AI systems will reliably act in ways that are desirable to their human users.

However, few have questioned whether alignment with human values should be the only goal of AI safety. An alternative approach – which I will call worst-case AI safety 1 – is to focus our efforts on finding safety measures that reduce the risk of particularly bad outcomes.

In this post, I will explain what worst-case AI safety is, discuss how it relates to alignment, and present arguments for why worst-case AI safety is important, tractable, and neglected.

I will not discuss the following points that have already been covered elsewhere:

  1. What risks of astronomical suffering (s-risks) are, and why we should take s-risks seriously
  2. Why we might endorse moral views that place particular importance on the reduction of suffering
  3. How advanced AI could lead to (or prevent) s-risks
  4. Concrete focus areas of worst-case AI safety

What is worst-case AI safety?

In a nutshell, worst-case AI safety is AI safety focused on s-risk reduction. Put broadly, the central question of AI safety is how we can ensure beneficial behaviour of powerful artificial agents that match or surpass humans in general intelligence. AI alignment work interprets this as the problem of how we should design such agents so that they reliably pursue human values, while worst-case AI safety addresses the problem of how they can be designed in a way that reliably avoids s-risks.

Specifically, this means that a) advanced AI does not instantiate large amounts of suffering for instrumental reasons, and b) escalating conflicts between advanced AI systems, or between AIs and humans, do not lead to very bad outcomes. We can break down the latter into b1) making sure that other agents do not have instrumental reason to create a lot of disvalue in an attempt to extort the AI system, and b2) making sure that the AI system itself will not use (illegitimate) threats against other AI systems or humans, or will not go through with such threats. (There’s an argument for why b1) and b2) are prima facie comparably important.)

Worst-case AI safety is closely related, but not identical, to the concept of fail-safe AI. Fail-safe measures specifically aim to improve the outcome in case of a failure to align AI systems with human values. A successful implementation of fail-safe AI would mean that if the primary approach to AI safety doesn’t pan out, then the result will be a “benign” failure that will at least not involve vast amounts of suffering.

(Worst-case AI safety is not only about fail-safe measures, though. For instance, implementating surrogate goals in an aligned AI system would not count as fail-safe AI. Conversely, fail-safe measures may be useful not only for worst-case AI safety, but also for conventional AI safety – although the nearest unblocked strategy problem limits its potential as an alignment approach.)

For more technical details on what worst-case AI safety could look like, see Focus areas of worst-case AI safety. However, I’d like to emphasize that reducing s-risks of advanced AI may also involve non-technical work, such as better international cooperation to prevent arms races, moral advocacy to improve the values of future civilization, or research on AI policy and AI strategy. It’s not clear whether technical work is more valuable than non-technical approaches. Nevertheless, I use the term worst-case AI safety specifically to refer to technical safety measures.

How does worst-case AI safety relate to alignment?

From the perspective of s-risk reduction, alignment is not necessarily sufficient. A controlled AI could also lead to s-risks. Even if an aligned AI is, all things considered, less likely to lead to s-risks, it would be a striking coincidence if alignment work is also most effective for s-risk reduction. (Conversely, research on worst-case AI safety is unlikely to be the best for alignment.)

That said, worst-case AI safety and AI alignment are complementary, not opposed. A good result would be that AI is both controlled and there are lots of precautionary measures against suffering. Since we will design artificial agents from scratch, it is unlikely that we will face strong tradeoffs between worst-case AI safety measures and alignment.

In general, developing advanced AI in a careful, safety-conscious and cooperative way will likely improve the outcome for all value systems compared to a baseline scenario where unchecked economic forces determine the future.

Why work on worst-case AI safety?

Importance

Similar to AI alignment work, the case for working on worst-case AI safety hinges on the belief that the advent of advanced AI is likely to have a transformative impact on the long-term future of our civilization. Many authors have written about this (e.g. 1, 2, 3, 4, 5), which is why I will not repeat the discussion in this post. (I’ve written down my own thoughts here.)

Conditional on accepting that shaping advanced AI is important, we face the question of how to prioritize between alignment efforts and worst-case AI safety. This is mainly a normative question. Proponents of suffering-focused ethics argue that preventing severe suffering should be our top priority and will therefore favor worst-case AI safety; while moral views that assign a lot of value to utopian outcomes will tend to prioritise alignment.2 This normative question has also been discussed at length – see e.g. 1, 2, 3, 4, 5.

Tractability

It seems plausible that worst-case AI safety is somewhat more tractable than alignment. This is because the goal of preventing specific bad outcomes is less ambitious than the goal of alignment with human values, especially if human values are complex and fragile. It remains to be seen, though, how big this difference in tractability is. For instance, it’s possible that the key challenge to AI safety work is to get any reliable guarantee regarding the behaviour of advanced AI at all.

Many arguments against the tractability of AI safety work also apply to worst-case AI safety. It may be very hard to do useful work on AI safety at this point, especially if advanced AI will not happen any time soon or if paradigm changes render early work useless. I think that the transition to AI will likely be a gradual and distributed process that takes a fairly long time (in terms of subjective time or economic doublings). This might be reason to be pessimistic about our influence on the long-run future.

However, this applies equally to alignment work, so worst-case AI safety is still at least as tractable as alignment. (Also, all things considered, I think it is reasonable for effective altruists to work on shaping advanced AI.)

Neglectedness

Worst-case AI safety is currently about as neglected as it gets. A relatively small number of people work on AI safety, and most of these focus on alignment – so it seems plausible that there are low-hanging fruits in worst-case AI safety.

It’s a mistake, though, to only consider existing efforts when evaluating neglectedness. A cause area is not neglected if we have good reason to expect that a lot of resources will be directed towards it in the future.

But I think it is likely that worst-case AI safety will remain far more neglected than alignment. A lot of people will start to work on alignment if and when it becomes clearer that this is an important problem. In case of a gradual and distributed takeoff, strong economic forces will push towards alignment: it’s not economically useful to have a powerful AI system that doesn’t reliably do what you want.

In contrast, it seems unlikely that there will be economic incentives for precautionary measures to avoid s-risks, or that this approach will ever become mainstream. This is bad news, but it also suggests that effective altruists can have a particularly big marginal impact by working on worst-case AI safety.

Footnotes

  1. Other authors have used the term suffering-focused AI safety to refer to the same idea.
  2. At least that’s what many advocates of such views believe. Depending on the specific values, one could perhaps make a case that such views should actually focus on their own endeavor rather than pursuing vanilla alignment, especially if the conception of value or “utopia” is very specific and differs from common “human values”.

3 comments

  1. >…it would be a striking coincidence if alignment work is also most effective for s-risk reduction.

    Compare:

    >It would be a striking coincidence if working on algorithms to help people find webpages related to “sometimes when I’m alone I use comic sans” was also helpful for serving unrelated queries such as “what would happen if I hired two private investigators to follow one another”.

    And yet, what do you know–people use the same set of algorithms, provided by Google, to make these very different queries! That’s often how it works in computer science… the exact same algorithm that can be used to sort a list of numbers can be trivially repurposed to sort a list of strings, etc.

    1. Good points. I agree that it being a “striking coincidence” is not obvious. And I agree that, if one thought that most disvalue comes from misaligned AI, then s-risk reducers should just focus on the top priorities in alignment.

      However, if we never tried to align AI in the first place, the worst failure modes (where arguably most of the expected disvalue comes from) would not arise.
      See the discussion here (https://www.lesswrong.com/posts/3WMscsscLEavkTJXv/s-risks-why-they-are-the-worst-existential-risks-and-how-to#QwfbLdvmqYqeDPGbo) and here (https://arbital.com/p/hyperexistential_separation/). So the main thing to worry about is not that alignment never gets off the ground; it’s that it works well in some respects but things overall don’t turn out as planned.

      Therefore, I find it very unlikely that there’s complete overlap between alignment work generally and work within alignment (or complementary to it) that effectively reduces s-risks.

      Maybe I should mention that I now think most of the ideas I suggested in https://foundational-research.org/suffering-focused-ai-safety/ are inapplicable and way too “clumsy” compared to the most promising alignment approaches. For similar reasons I dislike many of the ideas Tobias lists in his “Focus areas of worst-case AI safety” post.

      I’m (tentatively) most happy to see more research on bargaining, threats (and decision theory related to that), things like surrogate goals, and trying to safely do pivotal acts with AI without fully specifying human values just yet. Differentially pushing forward these subareas seems a lot more likely to reduce suffering risks than other alignment-related research topics.

      And I think there’s a lot of disentanglement-type research left to be done in this area. If anyone understands alignment approaches well and is interested in thinking about this question more, please get in touch!

      1. Yeah, alignment is general-purpose in that one could potentially align an AI with any goal, including that of s-risk reduction. In that case alignment and worst-case AI safety would be identical (with some caveats). However, in practice, people will align AI with “human values” or some economic goal or whatever, but not s-risk reduction, so it’s not the same. (One could argue, though, that it’s not a technical problem anymore at this point.)

        > However, if we never tried to align AI in the first place, the worst failure modes (where arguably most of the expected disvalue comes from) would not arise.
        I’m not so sure about that. A completely misaligned AI is also dangerous because it might create incidental s-risks and because it might engage in conflicts with other (aligned) AIs. It’s possible that there are AI designs that entail significantly more expected suffering than either a completely aligned or misaligned AI, but there might also be more benign ones. The question is whether the particularly dangerous designs become more likely as a result of alignment efforts. That’s not obvious to me.

        (Even if it were true, it wouldn’t be clear whether additional alignment efforts will on the margin make that kind of “near miss” more or less likely given that there are already some efforts.)

Leave a Reply

Your email address will not be published. Required fields are marked *