Skip to content

VIRAL

VIRAL

__init__(env_type, model_actor, model_critic, hf=False, vd=False, seed=None, training_time=25000, nb_vec_envs=1, legacy_training=True, options={}, proxies=None)

Initialize VIRAL architecture for dynamic reward function generation

Parameters:

Name Type Description Default
env_type EnvType

Environment type for which the reward function is generated

required
model_actor str

LLM model for reward function generation

required
model_critic str

LLM model for reward function evaluation

required
hf bool

Enable human feedback. Defaults to False.

False
vd bool

Enable video description. Defaults to False.

False
seed int

Random seed for training. Defaults to None.

None
training_time int

Training time in seconds. Defaults to 25000.

25000
nb_vec_envs int

Number of vectorized environments. Defaults to 1.

1
legacy_training bool

Use legacy training. Defaults to True.

True
options dict

LLM generation options. Defaults to {}.

{}
proxies dict

Proxy configuration. Defaults to None.

None

critical_refine_reward(idx)

Refine a reward function that has critical performance issues. This method refines a reward function that has critical performance issues based on the evaluation results. It uses a Language Model (LLM) to generate a new reward function that addresses the identified issues.

Parameters:

Name Type Description Default
idx int

Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation.

required

Returns:

Name Type Description
int int

Index of the newly created refined reward function in the memory.

generate_context()

Generate a context for the reward function generation process. This method uses a Language Model (LLM) to generate a context for the reward function generation process. The context includes information about the environment, task, and goal to be achieved.

generate_reward_function(n_init=2, n_refine=1, focus='')

Generate and iteratively improve a reward function using a Language Model (LLM).

This method implements a sophisticated reward function generation process that involves multiple stages of creation, evaluation, and refinement.

Key Stages
  1. Initial Function Generation: Create two initial reward function candidates
  2. Evaluation: Compare and identify the best and worst performing functions
  3. Iterative Refinement: Progressively improve the worst-performing function

Parameters:

Name Type Description Default
task_description str

A detailed description of the task or environment for which the reward function is being generated.

required
iterations int

Number of refinement iterations to perform. Defaults to 1.

required

Returns:

Type Description
list[State]

list[State]: A list of generated and refined reward function states, containing information about each function's performance and implementation.

Process Overview
  • Generates two initial reward functions using an LLM
  • Evaluates these functions using a policy evaluation method
  • Selects the worst-performing function for refinement
  • Iteratively refines the function through self-refinement
  • Tracks the evolution of reward functions in the memory
Detailed Workflow
  1. Generate two initial reward functions
    • Uses a predefined prompt template
    • Applies configurable LLM generation options
    • Compiles and tests each generated function
  2. Evaluates initial functions
    • Identifies best and worst performing functions
  3. Iterative Refinement
    • Applies self-refinement to the worst-performing function
    • Re-evaluates after each refinement
    • Repeats for specified number of iterations
Note
  • Uses dynamic LLM configuration options
  • Supports flexible environment types
  • Provides a systematic approach to reward function generation
  • Logging at various stages for debugging and tracking

human_feedback(prompt, idx)

Request human feedback on a reward function to refine it further.

Parameters:

Name Type Description Default
prompt str

The prompt to present to the human for feedback

required
idx int

The index of the reward function in the memory

required

Returns:

Name Type Description
str str

The updated prompt with human feedback included

self_refine_reward(idx)

Iteratively improve a reward function using self-refinement techniques.

This method implements an intelligent self-refinement process for reward functions by leveraging a Language Model (LLM) to analyze and improve the current function based on its previous performance.

Key Objectives
  • Analyze current reward function performance
  • Generate an improved version of the reward function
  • Maintain the core task objectives while optimizing the reward signal

Parameters:

Name Type Description Default
idx int

Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation.

required

Returns:

Name Type Description
int int

Index of the newly created refined reward function in the memory.

Refinement Process
  1. Construct a refinement prompt with:
    • Current reward function code
    • Performance metrics
    • Explicit refinement goals
  2. Generate a new reward function using LLM
  3. Compile and validate the new function
  4. Append the new function to memory
  5. Return the index of the new function
Refinement Goals
  • Increase success rate of the policy
  • Optimize the reward signal for better learning
  • Preserve the original task objectives
  • Improve overall performance
Notes
  • Uses the existing memory to track function evolution
  • Leverages LLM for intelligent function refinement
  • Provides a systematic approach to reward function improvement
  • Maintains a history of function iterations

test_reward_func(reward_func)

Test a reward function using the policy trainer.

Parameters:

Name Type Description Default
reward_func str

The reward function to test

required

video_description(prompt, idx)

Request a video description from the user to refine a reward function.

Parameters:

Name Type Description Default
prompt str

The prompt to present to the user for video description

required
idx int

The index of the reward function in the memory

required

Returns:

Name Type Description
str

The updated prompt with the video description included