Human-aligned artificial intelligenceHow can we ensure artificial intelligence systems act in accordance with human values?
Interested in working on this research direction? Apply for our coaching
This profile is tailored towards students studying computer science, maths, philosophy and ethics, and psychology and cognitive sciences, however we expect there to be valuable open research questions that could be pursued by students in other disciplines.
Why is this a pressing problem?
Artificial intelligence is becoming increasingly powerful. AI systems can solve college-level maths problems, beat champion human players at multiple games and generate high quality images. They can be used in many ways that could help humanity, for example by identifying cases of human trafficking, predicting earthquakes, helping with medical diagnosis and speeding up scientific discovery.
The AI systems described above are all ‘narrow;’ they are powerful in specific domains, but they can’t do most tasks that humans can. Nonetheless, narrow AI systems present serious risks as well as benefits. They can be designed to cause enormous harm – lethal autonomous weapons are one example – or they can be intentionally misused or have harmful unintended effects, for example due to algorithmic bias.
AI is also quickly becoming more general. One example is large language models or LLMs. These are AI systems that can do a wide range of language tasks, including unexpected things, like writing code or translation. You could try using ChatGPT to get a sense of current large language model capabilities.
It seems likely that at some point, ‘transformative AI’ will be developed. This phrase refers to AI that ‘precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution.’ One way this could happen is if researchers develop ‘artificial general intelligence;’ AI that is at least as capable as humans across most domains. AGI could radically transform the world for the better and help tackle humanity’s most important problems. However, it could also do enormous harm, even threatening our survival, if it doesn’t act in alignment with human interests.
Work on making sure transformative AI is beneficial to humanity seems very pressing. Multiple predictions (see here, here and here) suggest that transformative AI is likely within the next few decades, if not sooner. A majority of experts surveyed in 2022 believed there was at least a 5% chance of AI leading to extinction or similarly bad outcomes, while a near majority (48%) believed there was at least a 10% chance.
Working on preventing these outcomes also seems very neglected – 80,000 Hours estimates that 1,000 times more money is being spent on speeding up the development of transformative AI compared to the money spent on reducing its risks. Technical research to ensure AI systems are aligned with human values and benefit humanity therefore seems highly important.
Explore existing research
Explore the links below for overviews of research in this area:
- An annotated biography of recommended reading materials from CHAI.
- This interactive research map from the Future of Life Institute, setting out the technical research threads that could help build safe AI.
- AI Index 2021 Annual Report
- Neel Nanda’s overview of the AI alignment landscape
- Jacob Steinhardt’s AI Alignment research overview
- DeepMind, a research lab developing artificial general intelligence. The org as a whole focuses on building more capable systems, but has teams focused on AI safety.
- MIRI, a non-profit studying the mathematical underpinnings of artificial intelligence.
- Redwood Research, an organisation conducting applied AI alignment research.
- OpenAI, an AI research and deployment company developing artificial general intelligence. They have teams focused on AI safety; see the discussion of safety teams at OpenAI in this podcast episode.
- Anthropic, an AI safety company focused on empirical research.
- Alignment Research Center, an organisation attempting to produce alignment strategies that could be adopted in industry today.
- OpenAI. Similar to DeepMind, it’s a research lab developing artificial general intelligence, but has teams focused on AI Safety
- Cooperative AI, an organisation supporting research that will improve the cooperative intelligence of advanced AI.
- The Center for AI Safety, a nonprofit doing technical research and field-building.
- Center on Long-term Risk, a research institute aiming to address worst-case risks from the development and deployment of advanced AI systems.
Some academic research groups working on technical AI safety research are:
- The Center for Human-Compatible Artificial Intelligence, a research group based at UC Berkeley and led by Stuart Russell.
- Jacob Steinhardt’s research group at UC Berkeley.
- Sam Bowman’s research group at NYU.
- David Krueger’s research group at the University of Cambridge.
- The Algorithmic Alignment Group, led by Dylan Hadfield-Menell at MIT.
- The Future of Humanity Institute, a multidisciplinary research institute at the University of Oxford.
- The Foundations of Cooperative AI Lab at Carnegie Mellon University.
- The Alignment of Complex Systems research group at Charles University, Prague.
- Stanford Center for AI Safety, led by Clark Barrett.
Find a thesis topic
If you’re interested in working on this research direction, below are some ideas on what would be valuable to explore further. If you want help refining your research ideas, apply for our coaching!
Examples of problems you could work on are:
- Emergent behaviour research, which is needed to better understand the problems at hand (see the blog posts More Is Different for AI and Characterizing Emergent Phenomena in Large Language Models to learn more).
- Mechanistic interpretability – see this post for more details.
- AI safety via debate – see this Open AI blog post and Distill article, and this interview with Geoffrey Irving.
Research agendas you could explore for further inspiration include:
- Eliciting Latent Knowledge – Paul Christiano, Ajeya Cotra and Mark Xu
- AI Research Considerations for Human Existential Safety – Andrew Critch & David Krueger (2020)
- Unsolved Problems in ML Safety – Dan Hendrycks, Nicholas Carlini, J. Schulman, J. Steinhardt
You could also look at:
- This list of research agendas from AI Safety Support
- Ideas for ML & AI Safety Research
- Valuable areas of maths to pursue suggested in this post include infra-Bayesianism, finite factored sets, models of causality and cartesian frames.
- Embedded agency is the (almost philosophical) problem of how systems can reason about themselves. Illustrated series by Abram Demski.
- The Solomonoff prior and AIXI (see this paper) propose theoretical universal models and agents. Logic and Kolmogorov complexity.
- The concept of Logical induction joins complexity, logic, and markets. To learn more, start with this paper from MIRI and an illustrated story and this post.
- Mechanistic interpretability is a subfield of AI interpretability that includes various problems maths students could work on – see this post for more details.
- Value learning and inverse reinforcement learning involve trying to infer the goals and values of an agent from observations. This requires machine learning expertise but also knowledge of maths and game theory, and knowledge from behavioural psychology, specifically about revealed preferences. Useful resources to look at are this value learning series, this IRL survey and this impossibility result.
- Comprehensive AI Services of Eric Drexler is an alternative framing of the superintelligence problem, draws on many mathematical areas. Here is a summary from Rohin Shah and the original report.
- Understanding psychology and human rationality better would help us understand and represent people’s values. See this mathematical model of human bounded rationality for an example of this kind of research.
- The framework of Predictive Processing proposes a free-energy-based model for human and animal cognition, which may have implications for human values, physics, statistics and neuroscience. To learn more, see this paper by Karl Friston, and this speculative extension of the framework to a multi-agent model of the mind. Work on active inference and compressed sensing could also be relevant to AI safety research.
Resources you could explore for further inspiration include:
- Abstract open problems in AI alignment, v.0.1 — for mathematicians, logicians, and computer scientists with a taste for theory-building – Andrew Critch
- What are the coolest topics in AI safety, to a hopelessly pure mathematician? – EA Forum
‘Like most nascent fields, AI safety’s concepts are still nebulous, imprecise, and ill-defined. Clarifying this conceptual territory is a task that philosophers are particularly fit to handle.’ – The Center for AI Safety (CAIS)
Previous philosophical work that has contributed to the field of AI safety includes Nick Bostrom’s research on motivation and intelligence in artificial agents, Peter Railton’s research on moral learning in humans and how this might inform AI alignment, and how to develop truthful AI systems. CAIS suggests philosophers could explore questions such as how to incorporate ‘moral uncertainty into an AI’s decision-making in practice’ and how to avoid ‘agents tasked with pursuing a broad set of goals…[developing]…power-seeking tendencies.’
Richard Ngo’s post Technical AI Safety Research outside of AI suggests philosophers could explore ‘various questions in decision theory, logical uncertainty and game theory relevant to agent foundations’ (see this research agenda for more information).
See the posts and research agendas below for other ideas.
- Problems in AI alignment that philosophers could potentially contribute to’ and the comment discussion
- Synthesising a human’s preferences into a utility function – Stuart Armstrong
Psychology and cognitive sciences
Possible questions and sources of further ideas include:
- “How closely linked is the human motivational system to our intellectual capabilities – to what extent does the orthogonality thesis apply to human-like brains? What can we learn from the range of variation in human motivational systems (e.g. induced by brain disorders)?” (Richard Ngo’s post Technical AI Safety Research outside of AI)
- “Can we transfer insights from human psychology to the cognition and behavior of AI? Can our understanding of human cognition help to interpret AI systems and ensure that they are safe?” (Psychology for Effectively Improving the Future)
The paper AI Safety needs Social Scientists suggests that as people cannot always accurately report their preferences and aligning AI systems may require training them on human preferences, psychology research could also be useful for measuring the difference between people’s reported and real preferences and decreasing this difference. More specifically, psychologists could work on improving the proposed AI alignment technique of safety via debate. See the research paper to learn more. See also this post, which explores ways that the idea that ‘AI safety needs social scientists’ might be misinterpreted and see both the post and discussion in the comments for ideas on which research skills are needed for work on AI safety.
“Brain enthusiasts’” in AI Safety explores how a background in neuroscience could help researchers contribute to AI safety research and suggests research projects it could be valuable to pursue using these skills. These include deciphering human values and applying data analysis techniques from computational neuroscience to interpreting artificial neural networks. See all the research topics suggested here.
See the links below for other research questions and explorations of how a background in psychology or cognitive science could be relevant to AI safety research:
- Cognitive Science/Psychology As a Neglected Approach to AI Safety – Kaj Sotala
- Integrative Biological Simulation, Neuropsychology, and AI Safety, Sarma, G., et al.
- The case for becoming a black box investigator of language models – Buck Shlegeris
- Intro to Brain-Like AGI Safety – Steve Byrnes
Further resources
Learn more about the importance of this research direction:
- The Case for Taking AI Seriously as a Threat to Humanity – Vox
- AI Safety from First Principles – Richard Ngo
- Why AI alignment could be hard with modern deep learning – Ajeya Cotra
- The alignment problem from a deep learning perspective – Richard Ngo, Lawrence Chan & Sören Mindermann
- AI could defeat all of us combined – Holden Karnofsky gives an argument for why AI ‘only’ as intelligent as humans could pose an existential risk.
- Preventing an AI-related catastrophe – 80,000 Hours
- What could an AI caused catastrophe actually look like? – 80000 Hours
Explore the links below for overviews of research in this area:
- An annotated biography of recommended reading materials from CHAI.
- This interactive research map from the Future of Life Institute, setting out the technical research threads that could help build safe AI.
- AI Index 2022 Annual Report
- Neel Nanda’s overview of the AI alignment landscape
- Jacob Steinhardt’s AI Alignment research overview
Online courses
- AGI Safety Fundamentals online curriculum on technical AI alignment – Richard Ngo
- ML Safety Course – Dan Hendrycks at the Center for AI Safety
Get 1:1 advice on working on this research direction
- Apply for our coaching and we can connect you with researchers already working in this space, who can help you refine your research ideas.
- Apply for 80000 Hours coaching
- Apply for AI Safety Support career coaching.
Research fellowships, internships and programmes
If you’re interested in a programme that isn’t currently accepting applications, you can sign up for our newsletter to hear when it opens:
Summer research schools
- The CERI AI Fundamentals programme (technical track) is aimed at helping participants with a maths, CS or other mathematical science background gain an introduction to AI alignment research.
- The CHERI summer research program is for students who want to work on the mitigation of global catastrophic risks.
- The SERI Machine Learning Alignment Theory Scholars Program offers an introduction to the field of AI alignment and networking opportunities.
- The Center on Long-Term Risk’s summer fellowship is for researchers who want to work on research questions relevant to reducing suffering in the long-term future.
- The PIBBSS summer research fellowship is for researchers studying complex and intelligent behaviour in natural and social systems, who want to apply their expertise to AI alignment and governance.
- The Human-aligned AI Summer School (EA Prague) is a series of discussions, workshops and talks aimed at current and aspiring researchers working in ML/AI and other disciplines who want to apply their expertise to AI alignment.
Other fellowships, internships and programmes
- AGI Safety Fundamentals is held at the University of Cambridge and virtually, and is most useful to those with technical backgrounds who are interested in working on beneficial AI.
- The CAIS philosophy fellowship is a research fellowship aimed at clarifying risks from advanced AI systems, for philosophy PhD students or graduates.
- The CHAI (Center for Human-Compatible Artificial Intelligence) research fellowship is for researchers who have or are about to obtain a PhD in computer science, statistics, mathematics or theoretical economics.
- The CHAI internship, during which aspiring researchers work on a project under the supervision of a mentor.
- AI Safety Camp, which connects participants with a mentor with whom they collaborate on open AI alignment problems during intensive co-working sprints.
- AI Risk for Computer Scientists, a four-day series of workshops run by MIRI.
- The OpenAI Residency, a pathway to a full-time role at OpenAI for researchers and engineers who don’t currently focus on artificial intelligence.
- Refine, an incubator to help independent researchers build original research agendas related to AI safety.
Read advice on working on this research direction
Lists of resources for getting started
- Resources I send to AI researchers about AI safety – Vael Gates
- AI safety starter pack from–Marius Hobbhahn
- Victoria Krakovna’s list of AI alignment resources
AI alignment career advice
- How to pursue a career in technical AI alignment – EA Forum
- FAQ: Advice for AI alignment researchers – Rohin Shah
- Career review of an ML PhD – 80000 Hours
- Beneficial AI Research Career Advice – Adam Gleave
- From math grad school to AI alignment – Andrew Critch
- General advice for transitioning into Theoretical AI Safety – EA Forum
Getting into grad school
- Applying for Grad School: Q&A Panel from AI Safety Support
- Getting into CS grad school in the USA – Mark Corner
Doing independent research
- Alignment research exercises – Richard Ngo
- How to get into independent research on alignment/agency
- Getting started independently in AI Safety – EA Forum
- Study Guide – LessWrong
Other advice
- How I think students should orient to AI safety – Buck Shlegeris
- 7 traps that (we think) new alignment researchers often fall into – Akash
Interviews with AI safety researchers from 80000 Hours
- ML engineers Catherine Olsson and Daniel Ziegler on fast paths to becoming a machine learning alignment researcher.
- Dario Amodei on how to become an AI researcher.
- Miles Brundage on how to become an AI strategist.
- Jan Leike on how to become a machine learning alignment researcher.
Find supervisors, courses and funding
- Apply for our database of potential supervisors if you’re looking for formal supervision and take a look at our advice on finding a great supervisor for further ideas.
- Our funding database can help you find potential sources of funding if you’re a PhD student interested in this research direction.
- If you’re considering a PhD, as well as looking at the academic research groups we list above, see the computer science PhD programs listed here.
- Join the Future of Life Institute’s AI Existential Safety Community to apply for mini-grants, connect with other researchers and hear about conferences and other events.
- This AI Safety reading group meets fortnightly online.
- You can also apply to join our community if you’re interested in meeting other students working on this research direction.
- The AI Safety Accountability Programme is a Slack group for people who are interested in working on AI safety in the future and want to stay motivated while pursuing their goals.
Sign up for our Effective Thesis newsletter to hear about opportunities such as funding, internships and research roles.
Other newsletters that are useful for keeping up with advancements in AI are:
- Alignment Newsletter – Rohin Shah
- ChinAI newsletter – Jeff Ding
- AI Safety Support Newsletter
- ML Safety Newsletter – Dan Hendrycks
Import, AI – Jack Clark
Contributors
This profile was last updated 24/01/2023. Thanks to Tomáš Gavenčiak for originally writing this profile. Thanks to Jan Kirchner, Neel Nanda, Rohin Shah, Martin Soto and Dan Hendrycks for helpful feedback on parts of this profile. All mistakes remain our own. Learn more about how we create our profiles.
Where next?
Keep exploring our other services and content
Apply for coaching
Want to work on this research direction? Apply for coaching to receive personalised guidance.
Mechanistic Interpretability
Mechanistic interpretability is a subfield of AI interpretability that could let us interpret what goals a neural network has learned. Read our profile to learn more about working on this subfield.
Learn more about what it's like to work on technical AI safety research
In this interview PhD student Stephen Casper discusses topics such as his view of the AI safety research landscape, the advice he'd give his past self during his undergraduate thesis and the most valuable research skills he thinks an early-career researcher can develop.
Existential and global catastrophic risks
Learn about doing research focused on protecting humanity from large-scale catastrophes.