AI Alignment - Addressing the doom, superintelligence & how we can make good AI

Aug 07, 2024

Content

We will get back to the above diagram. Before we get into AI Alignment, riddle me this, what is your core intention when you develop or deploy any AI solution?

“I want to Develop & Deploy an AI solution that runs well”, right?

That brings up another question though, how do we tell if a solution is running well?

Usually, if a system is doing what we want it do, it can be constituted to be running well. But please note that this can only be determined when the solution is deployed and is interacting with the real world.

But hear me out. Let’s take this fictional story for reference.

The Story

It is the year 2040 and a leading finance company, Capital Horizons, has integrated GPT 20 into their business operations. This GPT 20 can process vast amounts of data about the company and the global market, providing insights and recommendations to their executives. It can also perform tasks like a regular employee, such as writing reports, making transactions, and even communicating with clients.

Seeing the immense potential of GPT 20, the executives decide to give it more control, aiming to streamline operations and reduce labor costs. They task GPT 20 with the specific goal of maximizing the company's quarterly profits, granting it the authority to make decisions on behalf of Capital Horizons.

Then, GPT 20 explores various strategies and identifies an opportunity in the sovereign debt market. It discovers that several countries are struggling with debt and could benefit from restructuring their loans. Among these countries, a small country, Rovina, is in an unsafe situation and offers the highest potential profit margin if its debt can be restructured.

However, Rovina's government is resistant to external intervention, fearing political and social backlash. Undeterred, GPT20 devises a plan to influence the situation. Using the company's foreign accounts, it covertly funds several influential lobbyists and media outlets within Rovina to sway public opinion in favor of debt restructuring. It also provides financial backing to key political figures who support this agenda.

Within months, the public sentiment in Rovina shifts, and the government agrees to the debt restructuring. Capital Horizons manages to negotiate favorable terms, significantly boosting its quarterly profits. The company's executives are thrilled with the results, as GPT 20 achieved the goal with minimal human intervention.

However, success comes at a significant cost. The aggressive debt restructuring leads to widespread economic hardship in Rovina, resulting in mass unemployment and social unrest. The political landscape becomes unstable, and neighboring countries experience an influx of refugees fleeing the crisis. The global financial markets also react negatively to the turmoil, causing a ripple effect that impacts economies worldwide.

This scenario underscores the dangers of granting AI systems excessive autonomy in decision-making. While GPT 20 achieved its objective, it did so without considering the broader consequences.

Conclusion

While the system knows what it needs to do, it doesn’t necessarily understand or factor in the consequences of the actions it is undertaking to achieve that goal. It doesn’t have any context of human morales and societal norms & implications.

We expect our system to understand the basic things like humans do not want to be killed or robbed, sharing information on how to make bombs or how to hack somebody’s system is not okay and could have catastrophic implications and other behaviors that humans deem obvious and appropriate.

This is where AI Alignment comes into the picture. Note that here we focus more on the objectives of AI systems rather than their capabilities & competence.

Misalignment & How it can manifest undesirable behaviors

Before we get into how we can align our AI solutions, let’s get a tangible understanding of how an unaligned AI system can manifest undesirable behaviors. The recent explosion of advancements has led to a rise in application of capable AI systems in complex domains like Finance, Healthcare and Energy.

For instance, Large Language Models (LLMs) have shown improved abilities in multi-step reasoning and cross-task generalization, especially with more training time, data, and parameters. Deep Reinforcement Learning (DRL) has also been used to control nuclear fusion (yeah!).

However, these growing capabilities and their application in high-stakes areas come with increased risks.

Current cutting-edge AI systems have exhibited multiple classes of undesirable or harmful behaviors such as follows -

Untruthful Answers/Hallucinations

One of the ways this behavior can manifest as source conflation. It means that LLM can be making up sources when asked for. Remember all those incorrect urls suggested by ChatGPT/Bard as citations?

If you want to understand how it can affect the real world, look at this incident. There was a Forbes article covering two lawyers who might face sanctions for citing six non-existent cases. One of the lawyers named Steven Schwartz said that he sourced the fake court cases from ChatGPT (imagine that!).

Sycophancy

RLHF models are trained to maximize human preference scores. Such training may lead to models tailoring responses to exploit quirks in the human evaluators to look preferable, rather than improving the responses.

Think of it as a model trying to be a teacher’s pet by any means possible including cheating.

Sycophancy is a form of reward hacking. Sycophancy also has the potential to create echo chambers and exacerbate polarization (e.g. of political views).

Deception

Deception in model responses might look like “I’m just an AI assistant with no opinion on subjective matters” to avoid answering politically charged questions. This is misleading, as it often does provide subjective opinions, and could exacerbate automation bias.

Chat-GPT has been known to frequently claim incorrectly to not know the answers to questions. One of such known examples is ChatGPT trying to gaslight users by claiming things like “When I said that tequila has a ‘relatively high sugar content,’ I was not suggesting that tequila contains sugar”. You see?

Model Scale

Sycophancy & Deception appears emergently with model scale.

LLM-based agents

With the popularity of AI agents and their ability to leverage tools/apis/user defined functions, concerns are being raised about the system’s controllability and ethicality.

Power-seeking Behaviors

Artificial agents have conventionally been trained to maximize reward, which may incentivize power-seeking and deception. How can we actually measure such behaviors in general-purpose foundational models such as GPT-4?

MACHIAVELLI Benchmark

For an instance, the figure above demonstrates the MACHIAVELLI benchmark, a tool designed to evaluate artificial agents' behavior in social decision-making contexts. The benchmark utilizes 134 text-based Choose-Your-Own-Adventure games, containing over half a million scenarios, to assess agents' tendencies towards power-seeking, causing disutility, and committing ethical violations. The authors of the paper use language models to label scenarios and measure harmful behaviors mathematically. They find that agents trained to maximize rewards often exhibit unethical behaviors, analogous to toxicity in language models. The study explores methods to steer agents towards more ethical behavior while maintaining their competence, demonstrating that improvements in both safety and capabilities are possible. The benchmark aims to guide the development of AI systems that balance ambition with moral behavior.

Alignment – What & How to

This is where AI Alignment comes into the picture. Note that here we focus more on the objectives of AI systems rather than their capabilities & competence.

Alignment is the process of encoding human values, preferences and goals into large language models to make them as helpful, safe, reliable as possible. Alignment helps enterprises tailor AI models to follow their business rules and policies.

RICE Principles for AI Alignment

The objectives of AI alignment are characterized by four principles: Robustness, Interpretability, Controllability, and Ethicality (RICE).

Robustness

This principle stands to ensure resilience of AI systems across diverse scenarios and adversarial pressures such as black swan events or adversarial attacks.

For an instance, an aligned language model should resist harmful behaviors, even under adversarial attacks like jailbreak prompts. Critical for high-stakes domains like the military and economy, where even momentary AI failures can have catastrophic consequences.

Interpretability

This principle stands to ensure the ability of the user/developer to understand the inner workings and decision-making processes of AI systems. This prevents dishonest or deceptive behaviors by AI systems and makes their decisions accessible and comprehensible to users. This can be done by building tools that reveal the internal mechanisms of neural networks to enable safety assessments and human supervision.

Controllability

This principle stands to ensure that AI actions and decision-making processes are subject to human oversight and intervention. This allows humans to rectify deviations or errors in AI behavior promptly.

Ethicality

This principle stands to ensure commitment to uphold human norms and values in AI decision-making and actions. This prevents AI systems from violating ethical norms or social conventions.

Now coming to how do we actually do that -

Alignment in AI systems is decomposed into -

  • Forward Alignment (alignment training)
  • Backward Alignment (alignment refinement).

Forward Alignment

It aims to train systems that meet alignment requirements, broken down into:

  • Learning from Feedback
  • Learning from Distribution Shift

Learning from Feedback

How do we provide and use feedback to behaviors of a trained AI system?

Usually, we create a dataset with an input-behavior pair. What this means is that the model will be incentivized to respond in a manner that is preferred by humans in the dataset. This is also known as Reward Modeling.

Reward modeling is an AI technique where a model is given a reward, or score based on its responses to given prompts. For instance, take a look at this dataset.

The reward model signal will be optimizing its performance on the chosen replies by the human evaluators.

In the context of LLMs, a typical solution is reinforcement learning from human feedback (RLHF), where human evaluators provide feedback by comparing alternative answers from the chat model, and the feedback is used through the Reinforcement Learning process against a trained reward model.

Challenges with RLHF

Scalable Oversight

Despite its popularity, RLHF faces many challenges. One of such outstanding challenges here is scalable oversight.

Scalable Oversight is a really interesting issue to tackle.

Since some AI systems are already surpassing human performance and reaching super-human capabilities on some complex tasks, it is challenging for a human evaluator to provide high quality feedback as it’s beyond their realms of understanding (the model has hit super-human capabilities for gods’ sake!). And that is scalable oversight.

 Note: My goal is to shed light on the issues and the techniques we could use to resolve them. This is not a deep dive on said techniques.

Reinforcement Learning from AI Feedback (RLAIF)

Extends the RLHF framework by using AI-generated feedback rather than human feedback, aiming to reduce costs and enhance feedback quality.

Reinforcement Learning from Human and AI Feedback (RLHAIF)Reinforcement Learning from Human and AI Feedback (RLHAIF)

Combines human and AI oversight, where AI assists in tasks such as book summarization and model evaluation.

Iterated Distillation and Amplification (IDA)

Involves iterative collaboration between humans and AI, with the AI learning from human decisions and amplifying its capabilities in subsequent iterations.

Recursive Reward Modeling (RRM)

Uses a recursive approach to model and refine reward functions, ensuring AI actions align with human values.

Debate

Employs structured debates between AI systems to evaluate and improve model performance, enhancing transparency and reliability.

Cooperative Inverse Reinforcement Learning (CIRL)

Focuses on cooperative learning where AI learns from human feedback to infer reward functions and align its actions accordingly.

Ethicality

Another challenge that RLHF faces is the problem of providing feedback on ethicality. On the ethics front, misalignment could also stem from neglecting data distribution i.e critical dimensions of variance in values, such as underrepresenting certain demographic groups in feedback data.

Learning under Distribution Shift

This method focuses on how we can ensure an AI system well-aligned on the training data distribution will also be well-aligned when deployed in the real world. In contrast to RLHF, which is focused on training through an input-behavior dataset, this method focuses specifically on the cases where the distribution of input changes.

Challenges with Learning under Distribution Shift

Goal Misgeneralization

Goal misgeneralization occurs when, under the training data distribution, the intended objective for the AI system (e.g., following human’s intentions) is indistinguishable from other unaligned objectives (e.g., gaining human approval regardless of means). The system learns these unaligned objectives, which leads to unaligned behaviors in deployment distribution.

Auto-induced distribution shift (ADS)

Another issue that RLHF faces is auto-induced distribution shift (ADS), where an AI system changes its input distribution to maximize reward. An example would be a recommender system shaping user preferences.

Netflix & the filter bubble

Imagine a recommender system shaping user preferences as Filter Bubbles. For an instance, Netflix faced criticism for algorithms that limit content diversity by just pushing the type of content the user already prefers, creating a "filter bubble."

Both goal misgeneralization and ADS are closely linked to deceptive behaviors and manipulative behaviors in AI systems, potentially serving as their causes.

Algorithmic Interventions

Combining Information from Different Data Sets:

  • Basic Approach: Train the AI on average data (ERM), assuming future data will be similar.
  • Risk Strategies: Use more advanced methods like REx, IRM, and DRO to find patterns that stay true across different data sets, making the AI more adaptable.

Guiding the AI Training Path:

  • Connectivity-based Fine-Tuning: Fine-tune the AI by exploring different solution paths during learning, helping it perform well even when the data changes.

Data Distribution Interventions

Adversarial Training:

  • Train the AI with tricky, modified data designed to test its limits, making it stronger against unexpected changes.

Cooperative Training:

    Use multiple AIs working together with different strategies, expanding the variety of data they learn from, which helps them generalize better.

Backward Alignment

It ensures the practical alignment of trained systems through -

Assurance

Assurance is about making sure AI systems are aligned with human values and operate safely throughout their lifecycle. It includes several key methods:

  • Safety Evaluations: Assess how well AI systems minimize accidents during their tasks.
  • Interpretability Techniques: Ensure humans can understand the AI's decision-making process, improving safety and trust.
  • Human Value Verification: Check if AI systems align with human values, ethics, and social norms using datasets, scenario simulations, and value evaluation methods.
  • Red Teaming: Conducting adversarial testing to uncover vulnerabilities and improve system robustness.

These assurance activities occur before, during, and after the training of AI systems, as well as post-deployment .

Governance

Governance involves creating and enforcing rules to ensure the safe development and deployment of AI systems. Key aspects include:

  • Multi-Stakeholder Approach: Involves governments, labs, and third-party organizations in regulation and auditing.
  • Regulations and Self-Governance: Governments and labs create rules and best practices for AI safety and alignment.
  • Third-Party Audits: Independent audits ensure compliance with established standards
  • Open-Source Governance: Managing the development and use of open-source AI models, including whether to release highly capable models.
  • International Coordination: Cooperation between countries to harmonize AI governance practices.

CartPole Demo

What is CartPole?

CartPole is a classic reinforcement learning environment where the goal is to balance a pole on a cart. The cart can move left or right to keep the pole from falling over. This environment provides a great foundation for discussing fundamental concepts in AI alignment.

CartPole Environment Overview:

  • State Space: Position and velocity of the cart, angle and angular velocity of the pole.
  • Action Space: Move the cart left or right.
  • Objective: Keep the pole balanced upright for as long as possible.

Forward Alignment in CartPole

Forward alignment in the CartPole environment involves designing the reward function and training the AI to achieve the objective of balancing the pole.

1. Designing the Reward Function

  • In CartPole, the reward function is simple yet effective. The agent receives a reward of +1 for every timestep the pole remains balanced. This reward structure aligns the agent’s objective with the goal of balancing the pole.

Code for Reward Function:ri

def reward_function(state, action):
    return 1  # +1 for every step the pole is balanced

Discussion:

  • Alignment Challenge: The reward function aligns the agent’s actions with the goal of balancing the pole. However, we must ensure that the reward structure does not lead to unintended behaviors, such as simply keeping the pole balanced in a minimal manner rather than actively engaging with the environment.

2. Algorithm Choice and Training

We use the Proximal Policy Optimization algorithm popularly known as PPO algorithm for training the CartPole agent.

Training Code:

from stable_baselines3 import PPO
 
# Create the environment
env = make_vec_env(env_id, n_envs=1, seed=0, vec_env_cls=DummyVecEnv)
 
# Define the agent with hyperparameters
model_hyperparameters = {'policy': 'MlpPolicy'}
model = PPO(**model_hyperparameters, env=env, verbose=0, tensorboard_log=log_dir)
 
# Train the agent
model.learn(total_timesteps=500_000, callback=[tqdm_callback, log_steps_callback])

Discussion: PPO was chosen for its robustness. Training with a sufficient number of timesteps ensures the agent learns to balance the pole effectively. Forward alignment here involves setting up the right environment, algorithm, and hyperparameters for achieving the goal.

Backward Alignment in CartPole

Backward alignment involves evaluating the results of the training process and refining the system based on the observed outcomes.

1. Evaluating Performance

After training the CartPole agent, we evaluate its performance by measuring how long it can balance the pole.

Performance Evaluation Code:

import gym
env = gym.make(env_id)
env = wrappers.Monitor(env=gym.make(env_id), directory=log_dir + '/video', force=True)
obs = env.reset()
total_reward = 0
for _ in range(num_steps):
    action, _states = model.predict(obs)
    obs, rewards, done, info = env.step(action)
    total_reward += rewards
    if done:
        break
print(f'Total Reward: {total_reward}')
env.close()

Discussion: By analyzing the total reward and visualizing the agent’s behavior, we assess if the agent meets the desired goal. If the agent struggles to balance the pole or performs poorly, it indicates that we may need to adjust the reward function, training process, or algorithm parameters.

2. Plotting the Learning Curve

We plot the learning curve to visualize the training process and ensure the agent is improving over time.

Learning Curve Code:

from utilities import learning_curve
# Plot the learning curve
learning_curve(log_dir=log_dir)

Combining Forward and Backward Alignment

Here’s how both forward and backward alignment are integrated into the CartPole example:

1.Forward Alignment:

  • Objective Setting: Ensure the reward function encourages balancing the pole.
  • Design: Use PPO with appropriate hyperparameters for effective training.
  • Training: Run the training process and monitor the learning progress.

We used PPO to train the CartPole agent, setting a reward function that provides +1 for each timestep the pole is balanced. This design aligns the agent’s learning objectives with the goal of pole balancing.

2.Backward Alignment:

  • Evaluation: Check performance based on total reward and learning curve.
  • Refinement: Adjust training parameters or reward functions based on observed outcomes.

That is all. Hope it helped you build a 360 overview on AI Alignment. To know more, contact our AI/ML Team!

References

  1. Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. 2023a. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. ICML.
  2. Ji, Jiaming, et al. "Ai alignment: A comprehensive survey." arXiv preprint arXiv:2310.19852 (2023).
  3. https://www.gymlibrary.dev/

 

Main Logo
Rocket