Research | Constellation Research Center

Publications

Constellation aims to support important research into AI safety. Here are some recent examples of work that was at least partly enabled by Constellation.

View all publications

Evaluating AI capabilities

Information security for safety-critical AI systems

Creating model organisms of misaligned AI

Modeling an explosion in AI capabilities

Identifying concrete paths to AI takeover

Building an AI observatory

Forecasting AI capabilities and impacts

Policies within AI labs

Domestic policy, legal issues, and international governance for safety-critical AI systems

Developing technology for AI governance

Broader communication about AI safety

Evaluating techniques for controlling AI

Relating to AI systems as moral patients or trading partners

Understanding cooperation between AI systems

AI-assisted reasoning

Reliably eliciting AI capabilities

Making AI systems adversarially robust

Supervising safety-critical AI systems

Constellation research focus areas

Below is a list of areas of work that Constellation is excited to support through our Visiting Fellows program, Constellation Residency, and other programs.

This list is updated periodically as the AI landscape (and our understanding of it) evolves, and it includes many ideas from discussions with our collaborators in the broader Constellation network. It is in no particular order and is not exhaustive.

Evaluating AI capabilities

Few benchmarks exist for measuring general, economically important AI abilities (and many such benchmarks have already been saturated). Constellation seeks to support people developing these AI capability benchmarks (such as the GAIA or GPQA datasets) or evaluations (such as those produced by METR, Google DeepMind, and others, or work aimed at satisfying this task implementation bounty or this RFP). We are especially interested in evaluations related to automating AI research and development, cyberoffense, bioweapon development, and persuasion.

Information security for safety-critical AI systems

‍We believe that any AI system that – if it were broadly accessible – would have a substantial potential to cause catastrophic harm must be kept under extremely high security (and both the OpenAI Preparedness Framework and Anthropic Responsible Scaling Policy gesture at this). However, the technology and practices needed to achieve this might not currently exist. Constellation therefore seeks to support the following work:

Developing extremely high-security measures (e.g. designing a secure facility and the associated physical security practices needed to achieve the highest level of security described in this report)
Developing AI-specific security mitigations (e.g. investigating under what circumstances compression and upload limits might prevent model weight exfiltration)
Developing means by which an AI developer’s claims about their level of security can be verified
Prototyping and mitigating novel AI-mediated vulnerabilities (e.g. could data poisoning attacks be used to assume control over agents running inside of an otherwise-secure datacenter?)
Applying frontier AI in ways that might improve with model scale (or creating proposals for using future AI) to improve information security (e.g. using trusted AIs to monitor high-security systems, or developing AI tools for formal verification) or physical security

Creating model organisms of misaligned AI

There are currently few empirical demonstrations of behavior related to AI takeover, and those that exist are fairly limited. Constellation wants to support the development of well-grounded demonstrations – or “model organisms” – so that we can both study risk factors for such failure modes arising and have testbeds in which to more realistically test various techniques for mitigating these hazards. These model organisms should instantiate one of these principal concerns related to AI takeover:

Reward hacking (as described abstractly here)
Scheming (as described abstractly here – with examples of potential future empirical work in Section 6 – and instantiated here).

‍

Modeling an explosion in AI capabilities

‍There are reasons to think that – via the application of AI technology to AI research and development – there may be a period of extremely rapid increase in AI capabilities, perhaps progressing from a point where most AI research and development is not automated to nearly complete automation within only a year or two (see this report for a description of this reasoning). We want to understand how likely various trajectories of AI capability development are, using diverse sources of evidence and theoretical modeling, and investigating questions such as the following:

What are the most plausible bottlenecks to an AI capabilities explosion (such as training run time or hardware manufacturing capacity)?
What are the most likely ways that AI developers might address those bottlenecks (such as research into post-training enhancements, inference efficiency, or low-level optimizations)?
What are potential early indicators of an AI capabilities explosion?
How far might capabilities improve?
What are the most appropriate ways to operationalize key terms in these questions?

Identifying concrete paths to AI takeover

‍We are interested in supporting work on threat modeling, including

Identifying and describing (with detailed discussion of considerations and quantitative analysis) concrete pathways to the disempowerment of humanity by AI systems,
Robust arguments for why some superficially plausible pathways are in fact very unlikely, and
In-depth analyses of the connection between potential catastrophic outcomes from AI and the AI capabilities that are most likely to increase the risk of those outcomes (such as capabilities related to hazardous autonomy, cyberoffense, or bioweapons development)

Building an AI observatory

We expect increasingly advanced AI to be increasingly widely deployed and have increasingly large impacts.

We are looking for people to capture the state of how AI is currently being used, including identifying and understanding the following:
- The applications in which AI is currently most commercially valuable
- Which resources and processes are currently under the control of AI systems
- Incidents of AI systems causing concrete harms
We are interested in other data collection and trend analysis related to transformative AI (along the lines of the work done by Epoch).

Forecasting AI capabilities and impacts

Experts disagree wildly about when various important AI capabilities might be developed and what the consequences of those developments are likely to be. Constellation is interested in supporting work to operationalize interesting claims (potentially using evaluations or benchmarks) and elicit expert predictions about AI capabilities or impacts, especially AI capabilities relevant to improving research and development of frontier AI systems, cyberoffense, bioweapons development, or persuasion.

Policies within AI labs

AI labs can institute internal organizational policies to manage risks from the AI systems they create. We are interested in supporting independent work on designing and evaluating such policies that are relevant to safety-critical AI systems. For example:

Understanding how AI labs training frontier models might best manage catastrophic risks (such as creating frameworks for corporate policies)
Describing how frontier AI labs might make safety cases (for example, see this report)
Identifying practices used by high-reliability organizations (such as incident reporting or having a chief risk officer role) that might be productively applied to AI labs that are creating potentially catastrophically dangerous AI systems

Domestic policy, legal issues, and international governance for safety-critical AI systems

We are especially interested in the following types of work:

Exploring legislative and regulatory means for reducing extreme risks from AI
Exploring relevant legal questions (e.g. regarding liability, model and developer licensing, antitrust considerations, windfall clauses, and national security law)
Understanding desiderata for international treaties or other forms of coordination between states (such as this work)
Identifying interventions that would help to avoid a state of permanent authoritarianism that was enabled by transformatively powerful AI technology
Exploring crucial new governance issues that might arise during the transition to a post-transformative-AI world

Developing technology for AI governance

We are interested in supporting the development of technical mechanisms to make AI governance measures more effective or less costly. For example:

Monitoring and verification methods, including on-chip mechanisms (e.g. this paper and this report)
Auditing methods that preserve privacy and intellectual property (e.g. ambitious extensions to the work outlined here)

Broader communication about AI safety

Constellation is interested in supporting people who are working to make important concepts, issues, and questions related to transformative AI more accessible and clear. This could include people working on podcasts, documentaries, books, articles, online courses, and more – the main criteria are that the work is high-quality, well-researched, and on-topic (that is, relevant to safely navigating the development of transformative AI).

Evaluating techniques for controlling AI

As described in this post by Redwood Research, we may want safety measures sufficient to prevent catastrophe even if the AI systems in use are intentionally attempting to subvert safety measures (and we at least want to know if some proposed safety measures would be insufficient in such a situation). Work that we are interested in supporting includes:

Extensions to the work in this paper, such as developing more realistic settings, better techniques, and different testbed regimes (for example, not having access to high-quality trusted labor, or having many more episodes per scenario)
Steganography evaluations
Evaluations of the ability of AIs to coordinate with each other
Evaluations of whether AIs are able to infer details about the security measures that are in place to constrain their actions, of whether they’re able to generate promising plans for subverting these security measures, and – for both of these capabilities – evaluating how much serial computation (e.g. the number of forward passes) the AIs require to succeed.

Relating to AI systems as moral patients or trading partners

Constellation is interested in supporting people investigating the following questions and related topics:

How would we know which AI systems – if any – are moral patients?
How might we actually evaluate this?
If we think certain AIs have a reasonable chance of being moral patients, how might we learn about their preferences?
Under what conditions should we compensate AI systems (perhaps holding this payment in escrow until we have adequate safety systems in place)?
How would we make such offers credible?
How much might such compensation change the likelihood of overall good outcomes?

Understanding cooperation between AI systems

If we use AI systems to monitor other AI systems, it will be easier to make this monitoring reliable if the relevant AIs are unable or disinclined to coordinate with each other. In other circumstances, we might want AI systems to be able to cooperate well with each other, such as when helping humans negotiate with each other (or in some exotic circumstances that might be important).

AI-assisted reasoning

Constellation is interested in developing AI tools for reasoning, including:

For investigating difficult-to-evaluate questions, such as making progress in philosophy or on controversial topics
For forecasting (which may soon be at human level), including AI for automatic question generation and resolution
For improving negotiations
For efficiently simulating commentary from different perspectives (eg “Here is what [Person X] might say about your proposal: …”)
For defending against adversarial persuasion attempts

Reliably eliciting AI capabilities

The integrity of many AI capability evaluations can depend upon the evaluators being able to elicit those capabilities when they are present, such that there are no potentially dangerous capabilities hidden with backdoors accessible only by, say, the AI developers or the AIs themselves. Constellation therefore seeks to support developing demonstrations of, and remedies for, sandbagging and elicitation problems that could corrupt evaluations, such as exploration hacking and gradient hacking.

Making AI systems adversarially robust

If we could remove certain particularly bad AI behaviors, this may substantially increase our ability to rely on AI systems and reduce risks from scheming AIs. Constellation wants to support work in this area, including work on

Cheaply identifying when AI models might not be robust (for example, studying how susceptibility to certain types of adversarial attacks changes with model scale),
Potentially scalable methods for preventing jailbreaks,
How to remove certain knowledge or behaviors (see for example this or this), and
Relaxed adversarial evaluation and training (such as latent adversarial training).

‍

Supervising safety-critical AI systems

Some AI systems might operate in regimes where they take actions that are not easy for humans to understand and which are therefore difficult to supervise. We want to support work on developing mitigations to this type of problem, including

Scalable oversight (such as AI debate)
Generalization (such as this or this – and see also this)
Model internals (such as off-policy probe methods or as applied to downstream tasks)
Empirical techniques for eliciting latent knowledge (especially in the case of measurement tampering)