Human-aligned artificial intelligenceHow can we ensure artificial intelligence systems act in accordance with human values?

Interested in working on this research direction? Apply for our coaching



Apply for coaching

This profile is tailored towards students studying computer science, maths, philosophy and ethics, and psychology and cognitive sciences, however we expect there to be valuable open research questions that could be pursued by students in other disciplines.

Why is this a pressing problem?

Artificial intelligence is becoming increasingly powerful. AI systems can solve college-level maths problems, beat champion human players at multiple games and generate high quality images. They can be used in many ways that could help humanity, for example by identifying cases of human trafficking, predicting earthquakes, helping with medical diagnosis and speeding up scientific discovery.

The AI systems described above are all ‘narrow;’ they are powerful in specific domains, but they can’t do most tasks that humans can. Nonetheless, narrow AI systems present serious risks as well as benefits. They can be designed to cause enormous harm – lethal autonomous weapons are one example – or they can be intentionally misused or have harmful unintended effects, for example due to algorithmic bias.

AI is also quickly becoming more general. One example is large language models or LLMs. These are AI systems that can do a wide range of language tasks, including unexpected things, like writing code or translation. You could try using ChatGPT to get a sense of current large language model capabilities.

It seems likely that at some point, ‘transformative AI’ will be developed. This phrase refers to AI that ‘precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution.’ One way this could happen is if researchers develop ‘artificial general intelligence;’ AI that is at least as capable as humans across most domains. AGI could radically transform the world for the better and help tackle humanity’s most important problems. However, it could also do enormous harm, even threatening our survival, if it doesn’t act in alignment with human interests.

Work on making sure transformative AI is beneficial to humanity seems very pressing. Multiple predictions (see here, here and here) suggest that transformative AI is likely within the next few decades, if not sooner. A majority of experts surveyed in 2022 believed there was at least a 5% chance of AI leading to extinction or similarly bad outcomes, while a near majority (48%) believed there was at least a 10% chance.

Working on preventing these outcomes also seems very neglected – 80,000 Hours estimates that 1,000 times more money is being spent on speeding up the development of transformative AI compared to the money spent on reducing its risks. Technical research to ensure AI systems are aligned with human values and benefit humanity therefore seems highly important.

Explore existing research

Research papers

Explore the links below for overviews of research in this area:

Organisations

  • DeepMind, a research lab developing artificial general intelligence. The org as a whole focuses on building more capable systems, but has teams focused on AI safety.
  • MIRI, a non-profit studying the mathematical underpinnings of artificial intelligence.
  • Redwood Research, an organisation conducting applied AI alignment research.
  • OpenAI, an AI research and deployment company developing artificial general intelligence. They have teams focused on AI safety; see the discussion of safety teams at OpenAI in this podcast episode.
  • Anthropic, an AI safety company focused on empirical research.
  • Alignment Research Center, an organisation attempting to produce alignment strategies that could be adopted in industry today.
  • OpenAI. Similar to DeepMind, it’s a research lab developing artificial general intelligence, but has teams focused on AI Safety
  • Cooperative AI, an organisation supporting research that will improve the cooperative intelligence of advanced AI.
  • The Center for AI Safety, a nonprofit doing technical research and field-building.
  • Center on Long-term Risk, a research institute aiming to address worst-case risks from the development and deployment of advanced AI systems.

Academic research groups

Some academic research groups working on technical AI safety research are:

Find a thesis topic

If you’re interested in working on this research direction, below are some ideas on what would be valuable to explore further. If you want help refining your research ideas, apply for our coaching!

Computer science

Examples of problems you could work on are:

Research agendas you could explore for further inspiration include:

You could also look at:

Maths

  • Valuable areas of maths to pursue suggested in this post include infra-Bayesianism, finite factored sets, models of causality and cartesian frames.
  • Embedded agency is the (almost philosophical) problem of how systems can reason about themselves. Illustrated series by Abram Demski.
  • The Solomonoff prior and AIXI (see this paper) propose theoretical universal models and agents. Logic and Kolmogorov complexity.
  • The concept of Logical induction joins complexity, logic, and markets. To learn more, start with this paper from MIRI and an illustrated story and this post.
  • Mechanistic interpretability is a subfield of AI interpretability that includes various problems maths students could work on – see this post for more details.
  • Value learning and inverse reinforcement learning involve trying to infer the goals and values of an agent from observations. This requires machine learning expertise but also knowledge of maths and game theory, and knowledge from behavioural psychology, specifically about revealed preferences. Useful resources to look at are this value learning series, this IRL survey and this impossibility result.
  • Comprehensive AI Services of Eric Drexler is an alternative framing of the superintelligence problem, draws on many mathematical areas. Here is a summary from Rohin Shah and the original report.
  • Understanding psychology and human rationality better would help us understand and represent people’s values. See this mathematical model of human bounded rationality for an example of this kind of research.
  • The framework of Predictive Processing proposes a free-energy-based model for human and animal cognition, which may have implications for human values, physics, statistics and neuroscience. To learn more, see this paper by Karl Friston, and this speculative extension of the framework to a multi-agent model of the mind. Work on active inference and compressed sensing could also be relevant to AI safety research.

Resources you could explore for further inspiration include:

Philosophy and ethics

‘Like most nascent fields, AI safety’s concepts are still nebulous, imprecise, and ill-defined. Clarifying this conceptual territory is a task that philosophers are particularly fit to handle.’ The Center for AI Safety (CAIS)

Previous philosophical work that has contributed to the field of AI safety includes Nick Bostrom’s research on motivation and intelligence in artificial agents, Peter Railton’s research on moral learning in humans and how this might inform AI alignment, and how to develop truthful AI systems. CAIS suggests philosophers could explore questions such as how to incorporate ‘moral uncertainty into an AI’s decision-making in practice’ and how to avoid ‘agents tasked with pursuing a broad set of goals…[developing]…power-seeking tendencies.’

Richard Ngo’s post Technical AI Safety Research outside of AI suggests philosophers could explore ‘various questions in decision theory, logical uncertainty and game theory relevant to agent foundations’ (see this research agenda for more information).

See the posts and research agendas below for other ideas.

Psychology and cognitive sciences

Possible questions and sources of further ideas include:

  • “How closely linked is the human motivational system to our intellectual capabilities – to what extent does the orthogonality thesis apply to human-like brains? What can we learn from the range of variation in human motivational systems (e.g. induced by brain disorders)?” (Richard Ngo’s post Technical AI Safety Research outside of AI)
  • “Can we transfer insights from human psychology to the cognition and behavior of AI? Can our understanding of human cognition help to interpret AI systems and ensure that they are safe?” (Psychology for Effectively Improving the Future)

The paper AI Safety needs Social Scientists suggests that as people cannot always accurately report their preferences and aligning AI systems may require training them on human preferences, psychology research could also be useful for measuring the difference between people’s reported and real preferences and decreasing this difference. More specifically, psychologists could work on improving the proposed AI alignment technique of safety via debate. See the research paper to learn more. See also this post, which explores ways that the idea that ‘AI safety needs social scientists’ might be misinterpreted and see both the post and discussion in the comments for ideas on which research skills are needed for work on AI safety.

Brain enthusiasts’” in AI Safety explores how a background in neuroscience could help researchers contribute to AI safety research and suggests research projects it could be valuable to pursue using these skills. These include deciphering human values and applying data analysis techniques from computational neuroscience to interpreting artificial neural networks. See all the research topics suggested here.

See the links below for other research questions and explorations of how a background in psychology or cognitive science could be relevant to AI safety research:

  

Further resources

Keep learning

Learn more about the importance of this research direction:

 

Explore the links below for overviews of research in this area:

 

Online courses

Get 1:1 advice on working on this research direction

Research fellowships, internships and programmes

If you’re interested in a programme that isn’t currently accepting applications, you can sign up for our newsletter to hear when it opens:

Summer research schools

 

Other fellowships, internships and programmes

  • AGI Safety Fundamentals is held at the University of Cambridge and virtually, and is most useful to those with technical backgrounds who are interested in working on beneficial AI.
  • The CAIS philosophy fellowship is a research fellowship aimed at clarifying risks from advanced AI systems, for philosophy PhD students or graduates.
  • The CHAI (Center for Human-Compatible Artificial Intelligence) research fellowship is for researchers who have or are about to obtain a PhD in computer science, statistics, mathematics or theoretical economics.
  • The CHAI internship, during which aspiring researchers work on a project under the supervision of a mentor.
  • AI Safety Camp, which connects participants with a mentor with whom they collaborate on open AI alignment problems during intensive co-working sprints.
  • AI Risk for Computer Scientists, a four-day series of workshops run by MIRI.
  • The OpenAI Residency, a pathway to a full-time role at OpenAI for researchers and engineers who don’t currently focus on artificial intelligence.
  • Refine, an incubator to help independent researchers build original research agendas related to AI safety.

Read advice on working on this research direction

Lists of resources for getting started


AI alignment career advice

 

Getting into grad school

Doing independent research

Other advice

 

Interviews with AI safety researchers from 80000 Hours

Find supervisors, courses and funding

Find community

  • Join the Future of Life Institute’s AI Existential Safety Community to apply for mini-grants, connect with other researchers and hear about conferences and other events.
  • This AI Safety reading group meets fortnightly online. 
  • You can also apply to join our community if you’re interested in meeting other students working on this research direction.
  • The AI Safety Accountability Programme is a Slack group for people who are interested in working on AI safety in the future and want to stay motivated while pursuing their goals.

Newsletters

Sign up for our Effective Thesis newsletter to hear about opportunities such as funding, internships and research roles.

Other newsletters that are useful for keeping up with advancements in AI are:

Contributors


This profile was last updated 24/01/2023. Thanks to Tomáš Gavenčiak for originally writing this profile. Thanks to Jan Kirchner, Neel Nanda, Rohin Shah, Martin Soto and Dan Hendrycks for helpful feedback on parts of this profile. All mistakes remain our own. Learn more about how we create our profiles.

Where next?

Keep exploring our other services and content

Apply for coaching

Want to work on this research direction? Apply for coaching to receive personalised guidance.

Mechanistic Interpretability

Mechanistic interpretability is a subfield of AI interpretability that could let us interpret what goals a neural network has learned. Read our profile to learn more about working on this subfield.

Learn more about what it's like to work on technical AI safety research

In this interview PhD student Stephen Casper discusses topics such as his view of the AI safety research landscape, the advice he'd give his past self during his undergraduate thesis and the most valuable research skills he thinks an early-career researcher can develop.

Existential and global catastrophic risks

Learn about doing research focused on protecting humanity from large-scale catastrophes.