Training neural networks to detect suffering – Reducing Risks of Future Suffering

Imagine a data set of images labeled “suffering” or “no suffering”. For instance, suppose the “suffering” category contains documentations of war atrocities or factory farms, and the “no suffering” category contains innocuous images – say, a library. We could then use a neural network or other machine learning algorithms to learn to detect suffering based on that data. In contrast to many AI safety proposals, this is feasible to at least some extent with present-day methods.

The neural network could monitor the AI’s output, its predictions of the future, or any other data streams, and possibly shut the system off to make it fail-safe. Additionally, it could scrutinize the AI’s internal reasoning processes, which might help prevent mindcrime.

The naive form of this idea is bound to fail. The neural network would merely learn to detect suffering in images, not the abstract concept. It would fail to recognize alien or “invisible” forms of suffering such as digital sentience. The crux of the matter is the definition of suffering, which raises a plethora of philosophical issues.

An ideal formalization would be comprehensive and at the same time easy to implement in machine learning systems. I suspect that reaching that ideal is difficult, if not impossible, which is why we should also look into heuristics and approximations. Crucially, suffering is “simpler” in a certain sense than the entire spectrum of complex human values, which is why training neural networks – or other methods of machine learning – is more promising for suffering-focused AI safety than for the goal of loading human values.

If direct implementations turn out to be infeasible, we could look into approaches based on preference inference. Just as any other preference, AI systems can potentially learn the preference to avoid suffering via (cooperative) inverse reinforcement learning. Alternatively, we might program AI systems to infer suffering from the actions and expressions of others. That way, if the AI observes an agent struggling to prevent an outcome¹, the AI should conclude that the realization of that outcome may constitute suffering.²

This requires sufficiently accurate models of other minds, which contemporary machine learning systems lack. It is, however, closer to the technical language of real-world AI systems than purely philosophical descriptions such as “a conscious experience with subjectively negative valence”.

Further research on the idea could focus on three areas:

How difficult is it to detect suffering in various domains using present-day machine learning frameworks? Suitable testbeds include images, movies or fictional writing. The value of information of this empirical work is high because it informs our estimate of how promising such approaches are.
What are realistic steps to advance towards learning the abstract concept of “suffering” as opposed to domain-specific instantiations? For example, we could equip a machine learning algorithm with an internal model of what suffering is, train it on a data set of images, and check whether it is able to detect suffering in other contexts such as fictional stories. (But this seems infeasible with current machine learning algorithms.)
How can we “translate” the philosophical concept of suffering into more technical terms? If this is not tractable, what are suitable heuristics? See Caspar Oesterheld’s paper on Formalizing preference utilitarianism in physical world models for an example of such work.

Footnotes

As Brian Tomasik points out in a comment, the AI may circumvent this by removing the ability to struggle. A better criterion is whether the agent would struggle to prevent the outcome if it could.
We might consider the violation of a preference ethically relevant even if it does not constitute hedonic suffering.

4 comments

Brian Tomasik says:

June 13, 2017 at 1:25 pm

This piece discusses a hypothetical “sentience classifier” based on input features. I mainly had it in mind as a concept to illustrate what I was trying to say, but potentially something like that could be implemented in future machines. In general, my approach to consciousness is a lot like a (one-layer? multi-layer?) neural network, in that I try to identify functional “features” of a system (such as doing reinforcement learning, broadcasting signals widely, using metrics of the goodness/badness of situations, and so on) that seem relevant to consciousness, and then systems with more and sophisticated versions of these input features get a higher output score.

Of course, a main problem is that few people will build suffering-minimizing AIs in practice, but at least allowing AIs to recognize suffering as one consideration for decision-making might be useful eventually.

> How difficult is the detection of suffering in various domains using present-day machine learning frameworks?

There’s probably some relevant literature in the fields of “sentiment analysis” and “affective computing”.

> if they struggle to prevent a particular outcome, it is likely that the realization of that outcome constitutes suffering.

One solution is to remove abilities to struggle. For example, if you must scream, the AI can make it so you have no mouth.

1. Tobias Baumann says:
  
  June 13, 2017 at 1:58 pm
  
  Thanks for your comment!
  
  > One solution is to remove abilities to struggle. For example, if you must scream, the AI can make it so you have no mouth.
  
  Fair point. Maybe we can refine the idea by instead asking whether the other mind would hypothetically struggle to prevent the outcome if it were able to do so?
  
  1. Brian Tomasik says:
    
    June 13, 2017 at 2:20 pm
    
    > Maybe we can refine the idea by instead asking whether the other mind would hypothetically struggle to prevent the outcome if it were able to do so?
    
    Perhaps. 🙂 It may get tricky to define what the right counterfactual is in which it can struggle. For example, if you hook up an audio recording to an electron that says “Get away from me” whenever you move a negative electric charge near the electron, this system would seem to “struggle” against the negative electric charge, but a lot of the work is being done by the added machinery. (Anyway, maybe even with this extra machinery, the system isn’t very morally important.)
    
Caspar Oesterheld says:

June 14, 2017 at 12:01 pm

Nice blog post!

>It would fail to recognize alien or “invisible” forms of suffering such as digital sentience.

I guess “invisible” forms of suffering cannot be detected, anyway? E.g. if you see a computer running, you cannot necessarily know what kind of programs it is running.

But I would also be curious how well modern ML systems would generalize the concept of suffering. I.e. if you have a data set with suffering animals, does it learn facial expressions usually associated with suffering or more high level concepts like avoidance behavior.

Footnotes

Leave a Reply Cancel reply