Imagine a data set of images labeled “suffering” or “no suffering”. For instance, suppose the “suffering” category contains documentations of war atrocities or factory farms, and the “no suffering” category contains innocuous images – say, a library. We could then use a neural network or other machine learning algorithms to learn to detect suffering based on that data. In contrast to many AI safety proposals, this is feasible to at least some extent with present-day methods.
The neural network could monitor the AI’s output, its predictions of the future, or any other data streams, and possibly shut the system off to make it fail-safe. Additionally, it could scrutinize the AI’s internal reasoning processes, which might help prevent mindcrime.
The naive form of this idea is bound to fail. The neural network would merely learn to detect suffering in images, not the abstract concept. It would fail to recognize alien or “invisible” forms of suffering such as digital sentience. The crux of the matter is the definition of suffering, which raises a plethora of philosophical issues.
An ideal formalization would be comprehensive and at the same time easy to implement in machine learning systems. I suspect that reaching that ideal is difficult, if not impossible, which is why we should also look into heuristics and approximations. Crucially, suffering is “simpler” in a certain sense than the entire spectrum of complex human values, which is why training neural networks – or other methods of machine learning – is more promising for suffering-focused AI safety than for the goal of loading human values.
If direct implementations turn out to be infeasible, we could look into approaches based on preference inference. Just as any other preference, AI systems can potentially learn the preference to avoid suffering via (cooperative) inverse reinforcement learning. Alternatively, we might program AI systems to infer suffering from the actions and expressions of others. That way, if the AI observes an agent struggling to prevent an outcome1, the AI should conclude that the realization of that outcome may constitute suffering.2
This requires sufficiently accurate models of other minds, which contemporary machine learning systems lack. It is, however, closer to the technical language of real-world AI systems than purely philosophical descriptions such as “a conscious experience with subjectively negative valence”.
Further research on the idea could focus on three areas:
- How difficult is it to detect suffering in various domains using present-day machine learning frameworks? Suitable testbeds include images, movies or fictional writing. The value of information of this empirical work is high because it informs our estimate of how promising such approaches are.
- What are realistic steps to advance towards learning the abstract concept of “suffering” as opposed to domain-specific instantiations? For example, we could equip a machine learning algorithm with an internal model of what suffering is, train it on a data set of images, and check whether it is able to detect suffering in other contexts such as fictional stories. (But this seems infeasible with current machine learning algorithms.)
- How can we “translate” the philosophical concept of suffering into more technical terms? If this is not tractable, what are suitable heuristics? See Caspar Oesterheld’s paper on Formalizing preference utilitarianism in physical world models for an example of such work.
- As Brian Tomasik points out in a comment, the AI may circumvent this by removing the ability to struggle. A better criterion is whether the agent would struggle to prevent the outcome if it could.
- We might consider the violation of a preference ethically relevant even if it does not constitute hedonic suffering.