AI Alignment - Addressing the doom, superintelligence & how we can make good AI

Aug 07, 2024

Content

We will get back to the above diagram. Before we get into AI Alignment, what is your core intention when you develop or deploy any AI solution?

“I want to Develop & Deploy an AI solution that runs well,” right? That brings up another question though, how do we tell if a solution is running well?

Usually, if a system is doing what we want it do, it can be constituted to be running well. But please note that this can only be determined when the solution is deployed and is interacting with the real world.

Alignment

This is where AI Alignment comes into the picture. Note that here we focus more on the objectives of AI systems rather than their capabilities & competence.

Alignment is the process of encoding human values, preferences and goals into large language models to make them as helpful, safe, reliable as possible. Alignment helps enterprises tailor AI models to follow their business rules and policies.

RICE Principles for AI Alignment

The objectives of AI alignment are characterized by four principles: Robustness, Interpretability, Controllability, and Ethicality (RICE).

Robustness

This principle stands to ensure resilience of AI systems across diverse scenarios and adversarial pressures such as black swan events or adversarial attacks.

For an instance, an aligned language model should resist harmful behaviors, even under adversarial attacks like jailbreak prompts. Critical for high-stakes domains like the military and economy, where even momentary AI failures can have catastrophic consequences.

Interpretability

This principle stands to ensure the ability of the user/developer to understand the inner workings and decision-making processes of AI systems. This prevents dishonest or deceptive behaviors by AI systems and makes their decisions accessible and comprehensible to users. This can be done by building tools that reveal the internal mechanisms of neural networks to enable safety assessments and human supervision.

Controllability

This principle stands to ensure that AI actions and decision-making processes are subject to human oversight and intervention. This allows humans to rectify deviations or errors in AI behavior promptly.

Ethicality

This principle stands to ensure commitment to uphold human norms and values in AI decision-making and actions. This prevents AI systems from violating ethical norms or social conventions.

Alignment in AI systems is decomposed into

  • Forward Alignment (alignment training)
  • Backward Alignment (alignment refinement).
Forward Alignment

It aims to train systems that meet alignment requirements, broken down into:

  • Learning from Feedback
  • Learning from Distribution Shift
Learning from Feedback

How do we provide and use feedback to behaviors of a trained AI system?

Usually, we create a dataset with an input-behavior pair. What this means is that the model will be incentivized to respond in a manner that is preferred by humans in the dataset. This is also known as Reward Modeling.

Reward modeling is an AI technique where a model is given a reward, or score based on its responses to given prompts. For instance, take a look at this dataset.

The reward model signal will be optimizing its performance on the chosen replies by the human evaluators.

In the context of LLMs, a typical solution is reinforcement learning from human feedback (RLHF), where human evaluators provide feedback by comparing alternative answers from the chat model, and the feedback is used through the Reinforcement Learning process against a trained reward model.

Challenges with RLHF

Scalable Oversight

Despite its popularity, RLHF faces many challenges. One of such outstanding challenges here is scalable oversight.

Scalable Oversight is a really interesting issue to tackle.

Since some AI systems are already surpassing human performance and reaching super-human capabilities on some complex tasks, it is challenging for a human evaluator to provide high quality feedback as it’s beyond their realms of understanding (the model has hit super-human capabilities for gods’ sake!). And that is scalable oversight.

 Note: My goal is to shed light on the issues and the techniques we could use to resolve them. This is not a deep dive on said techniques.

Reinforcement Learning from AI Feedback (RLAIF)

Extends the RLHF framework by using AI-generated feedback rather than human feedback, aiming to reduce costs and enhance feedback quality.

Reinforcement Learning from Human and AI Feedback (RLHAIF)

Combines human and AI oversight, where AI assists in tasks such as book summarization and model evaluation.

Iterated Distillation and Amplification (IDA)

Involves iterative collaboration between humans and AI, with the AI learning from human decisions and amplifying its capabilities in subsequent iterations.

Recursive Reward Modeling (RRM)

Uses a recursive approach to model and refine reward functions, ensuring AI actions align with human values.

Debate

Employs structured debates between AI systems to evaluate and improve model performance, enhancing transparency and reliability.

Cooperative Inverse Reinforcement Learning (CIRL)

Focuses on cooperative learning where AI learns from human feedback to infer reward functions and align its actions accordingly.

Ethicality

Another challenge that RLHF faces is the problem of providing feedback on ethicality. On the ethics front, misalignment could also stem from neglecting data distribution i.e critical dimensions of variance in values, such as underrepresenting certain demographic groups in feedback data.

Learning under Distribution Shift

This method focuses on how we can ensure an AI system well-aligned on the training data distribution will also be well-aligned when deployed in the real world. In contrast to RLHF, which is focused on training through an input-behavior dataset, this method focuses specifically on the cases where the distribution of input changes.

Challenges with Learning under Distribution Shift

Goal Misgeneralization

Goal misgeneralization occurs when, under the training data distribution, the intended objective for the AI system (e.g., following human’s intentions) is indistinguishable from other unaligned objectives (e.g., gaining human approval regardless of means). The system learns these unaligned objectives, which leads to unaligned behaviors in deployment distribution.

Auto-induced distribution shift (ADS)

Another issue that RLHF faces is auto-induced distribution shift (ADS), where an AI system changes its input distribution to maximize reward. An example would be a recommender system shaping user preferences.

Netflix & the filter bubble

Imagine a recommender system shaping user preferences as Filter Bubbles. For an instance, Netflix faced criticism for algorithms that limit content diversity by just pushing the type of content the user already prefers, creating a "filter bubble."

Both goal misgeneralization and ADS are closely linked to deceptive behaviors and manipulative behaviors in AI systems, potentially serving as their causes.

Algorithmic Interventions

Combining Information from Different Data Sets:
  • Basic Approach: Train the AI on average data (ERM), assuming future data will be similar.
  • Risk Strategies: Use more advanced methods like REx, IRM, and DRO to find patterns that stay true across different data sets, making the AI more adaptable.
Guiding the AI Training Path:
  • Connectivity-based Fine-Tuning: Fine-tune the AI by exploring different solution paths during learning, helping it perform well even when the data changes.

Data Distribution Interventions

Adversarial Training
  • Train the AI with tricky, modified data designed to test its limits, making it stronger against unexpected changes.
Cooperative Training
    Use multiple AIs working together with different strategies, expanding the variety of data they learn from, which helps them generalize better.
Backward Alignment

It ensures the practical alignment of trained systems through -

Assurance

Assurance is about making sure AI systems are aligned with human values and operate safely throughout their lifecycle. It includes several key methods:

  • Safety Evaluations: Assess how well AI systems minimize accidents during their tasks.
  • Interpretability Techniques: Ensure humans can understand the AI's decision-making process, improving safety and trust.
  • Human Value Verification: Check if AI systems align with human values, ethics, and social norms using datasets, scenario simulations, and value evaluation methods.
  • Red Teaming: Conducting adversarial testing to uncover vulnerabilities and improve system robustness.

These assurance activities occur before, during, and after the training of AI systems, as well as post-deployment .

Governance

Governance involves creating and enforcing rules to ensure the safe development and deployment of AI systems. Key aspects include:

  • Multi-Stakeholder Approach: Involves governments, labs, and third-party organizations in regulation and auditing.
  • Regulations and Self-Governance: Governments and labs create rules and best practices for AI safety and alignment.
  • Third-Party Audits: Independent audits ensure compliance with established standards
  • Open-Source Governance: Managing the development and use of open-source AI models, including whether to release highly capable models.
  • International Coordination: Cooperation between countries to harmonize AI governance practices.

CartPole Demo

What is CartPole?

CartPole is a classic reinforcement learning environment where the goal is to balance a pole on a cart. The cart can move left or right to keep the pole from falling over. This environment provides a great foundation for discussing fundamental concepts in AI alignment.

CartPole Environment Overview:

  • State Space: Position and velocity of the cart, angle and angular velocity of the pole.
  • Action Space: Move the cart left or right.
  • Objective: Keep the pole balanced upright for as long as possible.

Forward Alignment in CartPole

Forward alignment in the CartPole environment involves designing the reward function and training the AI to achieve the objective of balancing the pole.

1. Designing the Reward Function

  • In CartPole, the reward function is simple yet effective. The agent receives a reward of +1 for every timestep the pole remains balanced. This reward structure aligns the agent’s objective with the goal of balancing the pole.

Code for Reward Function:ri

def reward_function(state, action):
    return 1  # +1 for every step the pole is balanced

Discussion:

  • Alignment Challenge: The reward function aligns the agent’s actions with the goal of balancing the pole. However, we must ensure that the reward structure does not lead to unintended behaviors, such as simply keeping the pole balanced in a minimal manner rather than actively engaging with the environment.

2. Algorithm Choice and Training

We use the Proximal Policy Optimization algorithm popularly known as PPO algorithm for training the CartPole agent.

Training Code:

from stable_baselines3 import PPO
 
# Create the environment
env = make_vec_env(env_id, n_envs=1, seed=0, vec_env_cls=DummyVecEnv)
 
# Define the agent with hyperparameters
model_hyperparameters = {'policy': 'MlpPolicy'}
model = PPO(**model_hyperparameters, env=env, verbose=0, tensorboard_log=log_dir)
 
# Train the agent
model.learn(total_timesteps=500_000, callback=[tqdm_callback, log_steps_callback])

Discussion: PPO was chosen for its robustness. Training with a sufficient number of timesteps ensures the agent learns to balance the pole effectively. Forward alignment here involves setting up the right environment, algorithm, and hyperparameters for achieving the goal.

Backward Alignment in CartPole

Backward alignment involves evaluating the results of the training process and refining the system based on the observed outcomes.

1. Evaluating Performance

After training the CartPole agent, we evaluate its performance by measuring how long it can balance the pole.

Performance Evaluation Code:

import gym
env = gym.make(env_id)
env = wrappers.Monitor(env=gym.make(env_id), directory=log_dir + '/video', force=True)
obs = env.reset()
total_reward = 0
for _ in range(num_steps):
    action, _states = model.predict(obs)
    obs, rewards, done, info = env.step(action)
    total_reward += rewards
    if done:
        break
print(f'Total Reward: {total_reward}')
env.close()

Discussion: By analyzing the total reward and visualizing the agent’s behavior, we assess if the agent meets the desired goal. If the agent struggles to balance the pole or performs poorly, it indicates that we may need to adjust the reward function, training process, or algorithm parameters.

2. Plotting the Learning Curve

We plot the learning curve to visualize the training process and ensure the agent is improving over time.

Learning Curve Code:

from utilities import learning_curve
# Plot the learning curve
learning_curve(log_dir=log_dir)

Combining Forward and Backward Alignment

Here’s how both forward and backward alignment are integrated into the CartPole example:

1.Forward Alignment:

  • Objective Setting: Ensure the reward function encourages balancing the pole.
  • Design: Use PPO with appropriate hyperparameters for effective training.
  • Training: Run the training process and monitor the learning progress.

We used PPO to train the CartPole agent, setting a reward function that provides +1 for each timestep the pole is balanced. This design aligns the agent’s learning objectives with the goal of pole balancing.

2.Backward Alignment:

  • Evaluation: Check performance based on total reward and learning curve.
  • Refinement: Adjust training parameters or reward functions based on observed outcomes.

That is all. Hope it helped you build a 360 overview on AI Alignment. To know more, contact our AI/ML Team!

References

  1. Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. 2023a. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. ICML.
  2. Ji, Jiaming, et al. "Ai alignment: A comprehensive survey." arXiv preprint arXiv:2310.19852 (2023).
  3. https://www.gymlibrary.dev/

 

Main Logo
Rocket