VIRAL
VIRAL
__init__(env_type, model_actor, model_critic, hf=False, vd=False, seed=None, training_time=25000, nb_vec_envs=1, legacy_training=True, options={}, proxies=None)
Initialize VIRAL architecture for dynamic reward function generation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
env_type
|
EnvType
|
Environment type for which the reward function is generated |
required |
model_actor
|
str
|
LLM model for reward function generation |
required |
model_critic
|
str
|
LLM model for reward function evaluation |
required |
hf
|
bool
|
Enable human feedback. Defaults to False. |
False
|
vd
|
bool
|
Enable video description. Defaults to False. |
False
|
seed
|
int
|
Random seed for training. Defaults to None. |
None
|
training_time
|
int
|
Training time in seconds. Defaults to 25000. |
25000
|
nb_vec_envs
|
int
|
Number of vectorized environments. Defaults to 1. |
1
|
legacy_training
|
bool
|
Use legacy training. Defaults to True. |
True
|
options
|
dict
|
LLM generation options. Defaults to {}. |
{}
|
proxies
|
dict
|
Proxy configuration. Defaults to None. |
None
|
critical_refine_reward(idx)
Refine a reward function that has critical performance issues. This method refines a reward function that has critical performance issues based on the evaluation results. It uses a Language Model (LLM) to generate a new reward function that addresses the identified issues.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx
|
int
|
Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Index of the newly created refined reward function in the memory. |
generate_context()
Generate a context for the reward function generation process. This method uses a Language Model (LLM) to generate a context for the reward function generation process. The context includes information about the environment, task, and goal to be achieved.
generate_reward_function(n_init=2, n_refine=1, focus='')
Generate and iteratively improve a reward function using a Language Model (LLM).
This method implements a sophisticated reward function generation process that involves multiple stages of creation, evaluation, and refinement.
Key Stages
- Initial Function Generation: Create two initial reward function candidates
- Evaluation: Compare and identify the best and worst performing functions
- Iterative Refinement: Progressively improve the worst-performing function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task_description
|
str
|
A detailed description of the task or environment for which the reward function is being generated. |
required |
iterations
|
int
|
Number of refinement iterations to perform. Defaults to 1. |
required |
Returns:
Type | Description |
---|---|
list[State]
|
list[State]: A list of generated and refined reward function states, containing information about each function's performance and implementation. |
Process Overview
- Generates two initial reward functions using an LLM
- Evaluates these functions using a policy evaluation method
- Selects the worst-performing function for refinement
- Iteratively refines the function through self-refinement
- Tracks the evolution of reward functions in the memory
Detailed Workflow
- Generate two initial reward functions
- Uses a predefined prompt template
- Applies configurable LLM generation options
- Compiles and tests each generated function
- Evaluates initial functions
- Identifies best and worst performing functions
- Iterative Refinement
- Applies self-refinement to the worst-performing function
- Re-evaluates after each refinement
- Repeats for specified number of iterations
Note
- Uses dynamic LLM configuration options
- Supports flexible environment types
- Provides a systematic approach to reward function generation
- Logging at various stages for debugging and tracking
human_feedback(prompt, idx)
Request human feedback on a reward function to refine it further.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The prompt to present to the human for feedback |
required |
idx
|
int
|
The index of the reward function in the memory |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The updated prompt with human feedback included |
self_refine_reward(idx)
Iteratively improve a reward function using self-refinement techniques.
This method implements an intelligent self-refinement process for reward functions by leveraging a Language Model (LLM) to analyze and improve the current function based on its previous performance.
Key Objectives
- Analyze current reward function performance
- Generate an improved version of the reward function
- Maintain the core task objectives while optimizing the reward signal
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx
|
int
|
Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Index of the newly created refined reward function in the memory. |
Refinement Process
- Construct a refinement prompt with:
- Current reward function code
- Performance metrics
- Explicit refinement goals
- Generate a new reward function using LLM
- Compile and validate the new function
- Append the new function to memory
- Return the index of the new function
Refinement Goals
- Increase success rate of the policy
- Optimize the reward signal for better learning
- Preserve the original task objectives
- Improve overall performance
Notes
- Uses the existing memory to track function evolution
- Leverages LLM for intelligent function refinement
- Provides a systematic approach to reward function improvement
- Maintains a history of function iterations
test_reward_func(reward_func)
Test a reward function using the policy trainer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reward_func
|
str
|
The reward function to test |
required |
video_description(prompt, idx)
Request a video description from the user to refine a reward function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The prompt to present to the user for video description |
required |
idx
|
int
|
The index of the reward function in the memory |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
The updated prompt with the video description included |