Superintelligence Risk Mitigation

Setting the rules for controlling Superintelligence

We have to make an utmost endeavour to cover  all conceivable risks resulting from development of Superintelligencey. Otherwise, we may deliver the agent that will annihilate Humanity. The good news is that there are already some countermeasures in place, which aim at minimizing the risk of Superintelligence deployment. Until 2016, AI development was broadly guided by Three Laws of Robotics described by the science fiction writer Isaac Asimov in 1942 in a short story “Runaround” and later on repeated in his 1950 book “I Robot”. They are:

  • A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  • A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
  • A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws (Assimov, 1950).

These principles have now been replaced by 23 Asilomar Principles agreed at the Beneficial AI Conference at Asilomar, California on 5th January 2017 and signed by over 2,000 AI experts in the first three months. It is intended to be constantly evolving as new AI challenges appear. They have been split into three areas:

Research issues

  1. Research Goal: The goal of AI research should be to create not undirected intelligence, but beneficial intelligence
  2. Research Funding: Investments in AI should be accompanied by funding for research on ensuring its beneficial use, including thorny questions in computer science, economics, law, ethics, and social studies, such as:
  • How can we make future AI systems highly robust, so that they do what we want without malfunctioning or getting hacked?
  • How can we grow our prosperity through automation while maintaining people’s resources and purpose?
  • How can we update our legal systems to be more fair and efficient, to keep pace with AI, and to manage the risks associated with AI?
  • What set of values should AI be aligned with, and what legal and ethical status should it have?
  1. Science-Policy Link: There should be constructive and healthy exchange between AI researchers and policy-makers
  2. Research Culture: A culture of cooperation, trust, and transparency should be fostered among researchers and developers of AI
  3. Race Avoidance: Teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards

Ethics and Values

 Safety: AI systems should be safe and secure throughout their operational lifetime, and verifiably so where applicable and feasible

  1. Failure Transparency: If an AI system causes harm, it should be possible to ascertain why
  2. Judicial Transparency: Any involvement by an autonomous system in judicial decision-making should provide a satisfactory explanation auditable by a competent human authority
  3. Responsibility: Designers and builders of advanced AI systems are stakeholders in the moral implications of their use, misuse, and actions, with a responsibility and opportunity to shape those implications
  4. Value Alignment: Highly autonomous AI systems should be designed so that their goals and behaviours can be assured to align with human values throughout their operation.
  5. Human Values: AI systems should be designed and operated so as to be compatible with ideals of human dignity, rights, freedoms, and cultural diversity.
  6. Personal Privacy: People should have the right to access, manage and control the data they generate, given AI systems’ power to analyse and utilize that data.
  7. Liberty and Privacy: The application of AI to personal data must not unreasonably curtail people’s real or perceived liberty.
  8. Shared Benefit: AI technologies should benefit and empower as many people as possible.
  9. Shared Prosperity: The economic prosperity created by AI should be shared broadly, to benefit all of humanity.
  10. Human Control: Humans should choose how and whether to delegate decisions to AI systems, to accomplish human-chosen objectives.
  11. Non-subversion: The power conferred by control of highly advanced AI systems should respect and improve, rather than subvert, the social and civic processes on which the health of society depends.
  12. AI Arms Race: An arms race in lethal autonomous weapons should be avoided.

Longer-term Issues 

  1. Capability Caution: There being no consensus, we should avoid strong assumptions regarding upper limits on future AI capabilities.
  2. Importance: Advanced AI could represent a profound change in the history of life on Earth, and should be planned for and managed with commensurate care and resources.
  3. Risks: Risks posed by AI systems, especially catastrophic or existential risks, must be subject to planning and mitigation efforts commensurate with their expected impact.
  4. Recursive Self-Improvement: AI systems designed to recursively self-improve or self-replicate in a manner that could lead to rapidly increasing quality or quantity must be subject to strict safety and control measures.
  5. Common Good: Superintelligence should only be developed in the service of widely shared ethical ideals, and for the benefit of all humanity rather than one state or organization.

How to tame Superintelligence?

Stephen Hawking, the renowned physicist, and who was one of the most alarmed people among the scientists regarding the risks posed by Superintelligence said that “if Superintelligence isn’t the best thing to ever happen to us, it will probably be the worst”.

That’s why people like Nick Bostrom, one of the top experts on Superintelligence, think we need to invent some controlling methods to minimize the risk of Artificial General Intelligence (AGI) going terribly wrong. He defines these methods in his book “Superintelligence” (Bostrom, 2014). For our purpose I will try to provide a layman’s description of what it really means and what are the consequences for controlling the risks emerging from Superintelligence. The most important point is that these controlling methods must be in place before Superintelligence arrives, i.e. latest in this decade.

Nick Bostrom identifies the ‘control problem’ as the ‘principal-agent’ problem, a well-known subject in economic and regulatory theory. The problem can be looked from two perspectives:

  • The first ‘principal-agent’ problem: e.g. the problem faced by a client wanting to buy a house and employing an estate agent to fulfill exactly the client’s objective. In this scenario, the client is the principal (the person who wants some task to be performed in accordance with his interests), and an estate agent is the agent (the person carrying out the tasks on my behalf).
  • The second ‘principal-agent’ problem: e.g. the problem where the estate agent thinks primarily about his own interest e.g. to get the best possible agent’s fee

He dedicates a whole chapter to identify potential solutions. Since the publication of the book in 2013, they have been widely discussed in the AI community on how to turn them into practical tools. Bostrom splits them into two groups: Capability Control and Motivation Selection, which I have tried to put in as much as possible in layman’s terms, just for the purpose of this article.

How to Control the Capabilities of Superintelligence?

The ‘Control Problem’ involves human principals (sponsors or financing institutions) and human agents (AI developers). At some stage there will be an AI project to develop Superintelligence (AGI). It may be launched by one of the big IT/AI companies such as Google, Microsoft, IBM or Amazon. But it is also quite likely it will be initiated by some wealthy AI backers, which is already happening. Probably the most prominent among such people deeply involved in various top AI initiatives is Elon Musk. He is the founder of Paypal – a credit transaction payment system, SpaceX – rocket company, Hyperloop – a network of underground trains travelling at speeds of nearly 1,000 km/h, Neuralink a brain-computer interface venture, and several other large scale initiatives such as sending 1 million people to Mars by 2050. The second one is Jeff Bezos, the founder of Amazon and the richest man on the planet with assets of over $150bn, who is deeply involved in AI. His micro AI-product called Alexa Echo was sold to over 20m people by the end of 2017.

Such sponsors will need to ensure that AI developers carry out the project in accordance with their needs. They would also want to ascertain that the developers understand their sponsors’ needs correctly and that the developed AI product, which may turn into Superintelligence, will also understand and obey humans as expected. Failure to address this problem could become an existential risk for Humanity.

Bostrom specifies four possible solutions, which he calls the “Capability Control Method”. Its purpose is to tune the capabilities of superintelligent agent to the requirements of humans in such a way that we stay safe and have the ultimate control on what Superintelligence can do.

Boxing Methods of Control

This is perhaps the simplest and most intuitively compelling method of controlling Superintelligence – putting it into a metaphorical “box” i.e. a set of protocols that constrain the way, in which Superintelligence could interact with the world, always under the control of humans. It is often proposed that as long as Superintelligence is physically isolated and restricted, or “boxed”, it will be harmless.

A typical Superintelligence will be a superbly advanced computer with sophisticated algorithms (procedures how to process information) and will have three components: a sensor (or input channel); a processor; and an actuator (or output channel). Such superintelligent agent will receive inputs from the external world via its sensors e.g. Wi-Fi, radio communication, chemical compounds, etc. It will then process those inputs using its processor (computer) and will then respond (output information or perform some action using its actuators). An example of such an action could be advising on which decision should be made, to switch on or off certain engines, or completing financial transactions. But they could also be potentially significant e.g. whether a chemical compound would be safe for humans at a given dose.

However, it is highly unlikely that a superintelligent agent could be boxed in this way in the long term. Once the agent becomes superintelligent, it could persuade someone (the human liaison, most likely) to free it from its box and thus it would be out of human control. There are a number of ways of achieving this goal, some are included in the Bostrom’s book, such as:

  • Offering enormous wealth, power and intelligence to its liberator
  • Claiming that only it can prevent an existential risk
  • Claiming it needs outside resources to cure all diseases
  • Predicting a real-world disaster (which then occurs), then claiming it could have been prevented had it been let out

To counter such possibilities, there are some solutions that would decrease the chance of superintelligent agent escaping the ‘Box’, such as:

  • Physically isolating Superintelligence and permitting it zero control of any machinery
  • Limiting the Superintelligence’s outputs and inputs with regards to humans
  • Programming the Superintelligence with deliberately complex logic
  • Periodic resets of the Superintelligence’s memory
  • A virtual world between the real world and the AI, where its unfriendly intentions would be first revealed

However, as you yourself maybe aware, physical isolation is a solution that could be extremely difficult to control. It is already being severally thwarted by the rapid spread of Internet of Things (IoT), little gadgets like opening the door, switching on/off ovens, fridges, lights etc., which could be controlled at your home while you are away on the other side of the globe.

Incentive Methods

Bostrom refers to the second capability control method as the “incentive” method. The idea seems to be that if you create the right “incentive environment” Superintelligence wouldn’t be able to act in an existentially threatening manner. This is in some way an analogy to how to bring up a child. A child has its own goals, which may not be good for itself or the people around it right now or in the future. So, a good teacher can motivate his child in such a way that it behaves in morally and socially acceptable ways.


“Stunting”, as the name implies, involves hampering or disabling Superintelligence in some way. A good example would be running Superintelligence on a slow hardware, reducing its memory capacity, or limiting the kind of data it can process. Bostrom argues that the use of stunting poses a dilemma. Either we stunt Superintelligence too much and it just becomes another “dumb” piece of software; or we stunt it too little and it would be capable of overcoming its disabilities. Getting the balance just right could be pretty tricky.


This is a different method of capability control. It involves building into any AI development project a set of “tripwires” which, if crossed, will lead to the project being shut down and destroyed. Bostrom identifies three types of tripwire:

  • Behavioural Tripwires: A network of detectors would be placed around a “boxed” AI in order to detect any attempt to breach the containment, e.g. any attempt to make radio communication.
  • Ability Tripwires: The Superintelligence would be regularly tested for its capabilities. If at any point new abilities are detected or if it gets close to some capability threshold, it would be shut down.
  • Content Tripwires: The Superintelligence’s internal processing could be constantly monitored for any suspicious changes, e.g. changes in how it represents its final values.

Bostrom thinks that tripwires could be useful, particularly during the development phase if used in conjunction with other methods. But, unsurprisingly, he also thinks that they too have shortcomings. An advanced Superintelligence could subvert a tripwire in an unexpected way. He also notes that project developers working on Superintelligence could grow impatient if tripwires repeatedly hamper their progress. They might undermine any safety advantage gained by the tripwire system.

Motivation Selection Method

The second set of tools to control the undesired behaviour of Superintelligence is to try to motivate it to pursue the goals that are in our (human) interest, and that is why this approach is called the “Motivation Selection Method”. John Danaher provides a summary of these methods in his article “Bostrom on Superintelligence: Limiting an AI’s Capabilities”, parts of which I have used to convey below the essence of Motivation Selection in a less technical way.

It is in some way an extension of the ‘Incentive Method’ from the Capability Control set of tools. Bostrom is clear as with the Control Problem approach, that this set of methods would have to be implemented before an AI achieves Superintelligence. Otherwise, the Superintelligence could have a decisive strategic advantage over human beings, and it may be impossible to constrain or limit it in any way.

That is why I have already stressed it before that we have really about one decade, till about 2030, to implement mechanisms of controlling Superintelligence.

Direct Specification

This involves programming the Superintelligence directly with the “right” set of motivations. What could go wrong if a robot always follows Asimov’s first law that I mentioned earlier? Of course, anyone who has read the book will know that a lot can go wrong. Laws and rules of this sort are vague. In specific contexts they could be applied in very odd ways, especially if the robot has a very logical or literalistic mind. Take the first law as an example. It says that a robot may not, through inaction, allow any human to come to harm. This implies that the robot must be at all times seeking to avoid possible ways, in which humans could come to harm. A superintelligent robot, with a decisive advantage over human beings, might decide that the safest thing to do would be to put all humans into artificially induced comas. It wouldn’t be great for them, but it would prevent them from coming to harm.

So, anyone who has studied the development and application of human laws will be familiar with this problem. The drafters of those laws can never fully anticipate every possible future application. The same will be true for AI programmers and coders.


The second suggested method of motivation selection is called “domesticity”. The analogy here might be with the domestication of wild animals. Dogs and cats have been successfully domesticated and tamed from wild animals over many generations. The suggestion is that something similar could be done with superintelligent agents. They could be domesticated. The classic example of a domesticated superintelligence would be the so-called “oracle” device. This functions as a simple question-answering system. Its final goal is to produce correct answers to any questions it is asked. Even a simplistic micro AI gadget like “Alexa”, that I mentioned earlier, can already do that. Superintelligent agents would usually do just that from within a confined environment (a “box”). This would make it domesticated, in a sense, since it would be happy to work in a constrained way within a confined environment.

However, giving Superintelligence the seemingly benign goal of giving correct answers to questions could have startling implications. To answer the question, Superintelligence may require quite a lot of information, as anyone that has tried to talk with Google Home or Amazon Alexa appreciates. Once that information is stored in its memory, it will make the superintelligent agent more knowledgeable and more capable, increasing the risk of its misbehaviour, including a potential ‘runaway’, i.e. a total loss of control by humans.

Indirect Normativity

The third possible method of motivation selection is Indirect Normativity. The idea here is that instead of directly programming ethical or moral standards into Superintelligence, you give it some procedure for determining its own ethical and moral standards. If you get the procedure just right, Superintelligence might turn out to be benevolent and perhaps even supportive of human interests and needs. Superintelligence is to function much like an ideal, hyper-rational human being, which can “achieve that which we would have wished it to achieve if we had thought about the matter long and hard”.

One of the problems with this method of motivation selection is ensuring you’ve got the right norm-picking procedure. Getting it slightly wrong could have devastating implications, particularly if a superintelligent machine has a decisive strategic advantage over us.


This is quite different from the methods discussed thus far. There, the assumption was that Superintelligence would be delivered from scratch through a series of ever more intelligent AI agents. This assumes that we start with a system that has the “right” motivations and we increase its intelligence from there. The obvious candidate for such a system would be a human being (or a group of human beings). We could simply take their brains, with their evolved and learned motivations, and augment their capabilities until we reach a point of Superintelligence. (Ignore, for now, the ethics of doing this.) Such an approach is favoured by Transhumanists, who envisage that at some stage human species will merge with Superintelligence.

As Bostrom notes, augmentation might look pretty attractive if all other methods turn out to be too difficult to implement. Furthermore, it might end up being a “forced choice”. If augmentation is the only route to Superintelligence, then augmentation is, by default, the only available method of motivation selection. Otherwise, if the route to Superintelligence is via the development of AI, augmentation is not on the cards.

But a “solution” to the control problem by augmentation is not perfect either. If the system we augment has some inherent biases or flaws, we may simply end up exaggerating those flaws through a series of augments. It might be wonderful to augment a Florence Nightingale to Superintelligence, but it might be nightmarish to do the same with a Hitler. Furthermore, even if the starter-system is benevolent and non-threatening, the process of augmentation could have a corrupting effect.

Applying redefined Values of Humanity to Superintelligence

It is clear from the last paragraph in the preceding section that the least risky strategy, for delivering Superintelligence would be the process of augmentation (although it may also have some inherent dangers. Apart from enormous technical problems that will emerge, the equally important issue will be the kind of values the new augmented species should have, which would become more than just a digital Superintelligence. That’s why the need to define top values of Humanity, the foundation of human ethics, is so important.

These values will constitute the new Human Values Charter. We will then need to establish certain procedures, perhaps enshrined in laws regarding the transfer of these values into various shapes and types of AI robots and humanoids. This would create a kind of a framework where the Human Values Charter becomes the core of every AI agent’s ‘brain’. Such framework would be a boundary beyond which no AI agent could act and implemented using certain guidelines, such as 23 Asilomar principles mentioned earlier. Only then could the developers define specific goals and targets for such AI agents always referencing the Human Values Charter as constraints to agents’ objectives. In practical terms the best way forward could be to embed the Human Values Charter into a sealed chip that cannot be tampered, perhaps using quantum encryption, and implant it into any intelligent AI agent. Such procedure could be monitored by an independent global organization, which would manufacture and distribute those chips and licence the agents before they can enter public space. But even if such a chip is developed, there could still be a danger that confusion might arise from misinterpretation of what is expected from Superintelligence.

There are a number of proposals on how to ensure that Superintelligence acquires from humans only those values that we want. Nick Bostrom mentions them in his book “Superintelligence: Paths, Dangers, Strategies”, especially in the chapter on ‘Acquiring Values’, where he has developed quite a complex theory on the very process of acquiring values by Superintelligence.

The techniques specified by him aim to ensure the true representation of what we want. They are very helpful indeed, but as Bostrom himself acknowledges, it does not resolve the problem of how we ourselves interpret those values. And I am not talking just about agreeing the Charter of Human Values by Humanity, but rather expressing those values in such a way that they have a unique, unambigous meaning. That is the well-known issue of “Do as I say”, since quite often it is not exactly what we really mean. Humans communicate not just using words but also symbols and quite often additionally re-enforce it with body language to avoid the misinterpretation where double meaning of words is possible. Would it be possible to communicate with Superintelligence using body language in both directions? This is a well-known issue when writing emails. To avoid misinterpretation by relying on the meaning of words alone, we use emoticons.

How then would we further minimize misunderstanding? One possibility would be, as John Rawls, writes in his book “A Theory of Justice” to create algorithms, which would include statements like this:

  • do what we would have told you to do if we knew everything you knew
  • do what we would’ve told you to do if we thought as fast as you did and could consider many more possible lines of moral argument
  • do what we would tell you to do if we had your ability to reflect on and modify ourselves

We may also envisage within the next 20 years a scenario where the Superintelligence is “consulted”, on which values to adapt and why. There could be two options applied here (if we humans have still an ultimate control):

  1. In the first one the Superintelligence would work closely with Humanity and essentially would be under the total control of humans
  2. The second option, and I am afraid more likely, assumes that once Superintelligence achieves the benevolent Technological Singularity stage then it will probably be much cleverer than any human being in any aspect of human thinking or abilities. At such a moment in time, it will increase its intelligence exponentially and in a few weeks it would be millions of times more intelligent than any human being, creating a Technological Singularity event. Even if it is benevolent and has no ulterior motives, it may see that our thinking is constrained or far inferior to what it knows and how it sees what is ‘good’ for humans.

Therefore, it could over-rule humans anyway, for ‘our own benefit’, like a parent which sees that what a child wants is not good for it in the longer term because it simply cannot comprehend all the consequences and implications of agreeing to what a child wants. The question remains how Superintelligence would deal with values that are strongly correlated with our feelings and emotions such as love or sorrow. In the end, emotions make us predominantly human and they are quite often dictating us solutions that are utterly irrational. What would Superintelligence choice would be, if it based its decisions on rational arguments only? And what would it be if it also included, if possible, emotional aspects of human activity, which after all, makes us more human but less efficient and from the evolutionary perspective more vulnerable and less adaptable?

The way Superintelligence behaves and how it treats us will largely depend on whether at the Singularity point it will have at least basic consciousness. My own feeling is that if a digital consciousness is at all possible, it may arrive before the Singularity event. In such case, one of the mitigating solutions might be, assuming all the time that Superintelligence will from the very beginning act benevolently on behalf of Humanity, that decisions it would propose would include an element of uncertainty by taking into account some emotional and value related aspects.

In the long-term, I think there is a high probability that human race as we know it will evolve becoming a different non-biological species. In a sense, this would mean the extinction of a human species. Why should we be the only species not to become extinct? After all, everything in the universe is subject to the law of evolution. We have evolved from apes and we will evolve into a new species, unless some existential risks will annihilate civilization before then. We can speculate whether there will be augmented humans, synthetic humans, or entirely new humanoids, i.e. mainly digital humans with uploaded human minds or even something entirely different that we cannot yet envisage. It is quite likely, that humans will co-exist with two or even three species for some time but ultimately, we humans in a biological form will be gone at some stage.

The next question is should we “allow” the new breed of humanoids to define ethics for themselves or should they be jump-started by our ethics. In my view, we should try as much as possible make a transfer of human ethics into the new species. Therefore, whichever organisation takes over the task of saving Humanity, there is an urgent need to formally agree the renewed set of Universal Values and Universal Rights of Humanity so that the world could reduce the level of existential risks before the AI-based humanoids (Transhumans) adopt them. Beyond that, however, when our human ethics is re-defined at some stage, the values with the corresponding ethics are bound to change. Ethics is not static.

Leave a Reply

Your email address will not be published.

sixteen − 10 =