VIRAL
            VIRAL
    
            __init__(env_type, model_actor, model_critic, hf=False, vd=False, seed=None, training_time=25000, nb_vec_envs=1, legacy_training=True, options={}, proxies=None)
    Initialize VIRAL architecture for dynamic reward function generation
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| env_type | EnvType | Environment type for which the reward function is generated | required | 
| model_actor | str | LLM model for reward function generation | required | 
| model_critic | str | LLM model for reward function evaluation | required | 
| hf | bool | Enable human feedback. Defaults to False. | False | 
| vd | bool | Enable video description. Defaults to False. | False | 
| seed | int | Random seed for training. Defaults to None. | None | 
| training_time | int | Training time in seconds. Defaults to 25000. | 25000 | 
| nb_vec_envs | int | Number of vectorized environments. Defaults to 1. | 1 | 
| legacy_training | bool | Use legacy training. Defaults to True. | True | 
| options | dict | LLM generation options. Defaults to {}. | {} | 
| proxies | dict | Proxy configuration. Defaults to None. | None | 
            critical_refine_reward(idx)
    Refine a reward function that has critical performance issues. This method refines a reward function that has critical performance issues based on the evaluation results. It uses a Language Model (LLM) to generate a new reward function that addresses the identified issues.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| idx | int | Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation. | required | 
Returns:
| Name | Type | Description | 
|---|---|---|
| int | int | Index of the newly created refined reward function in the memory. | 
            generate_context()
    Generate a context for the reward function generation process. This method uses a Language Model (LLM) to generate a context for the reward function generation process. The context includes information about the environment, task, and goal to be achieved.
            generate_reward_function(n_init=2, n_refine=1, focus='')
    Generate and iteratively improve a reward function using a Language Model (LLM).
This method implements a sophisticated reward function generation process that involves multiple stages of creation, evaluation, and refinement.
Key Stages
- Initial Function Generation: Create two initial reward function candidates
- Evaluation: Compare and identify the best and worst performing functions
- Iterative Refinement: Progressively improve the worst-performing function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| task_description | str | A detailed description of the task or environment for which the reward function is being generated. | required | 
| iterations | int | Number of refinement iterations to perform. Defaults to 1. | required | 
Returns:
| Type | Description | 
|---|---|
| list[State] | list[State]: A list of generated and refined reward function states, containing information about each function's performance and implementation. | 
Process Overview
- Generates two initial reward functions using an LLM
- Evaluates these functions using a policy evaluation method
- Selects the worst-performing function for refinement
- Iteratively refines the function through self-refinement
- Tracks the evolution of reward functions in the memory
Detailed Workflow
- Generate two initial reward functions- Uses a predefined prompt template
- Applies configurable LLM generation options
- Compiles and tests each generated function
 
- Evaluates initial functions- Identifies best and worst performing functions
 
- Iterative Refinement- Applies self-refinement to the worst-performing function
- Re-evaluates after each refinement
- Repeats for specified number of iterations
 
Note
- Uses dynamic LLM configuration options
- Supports flexible environment types
- Provides a systematic approach to reward function generation
- Logging at various stages for debugging and tracking
            human_feedback(prompt, idx)
    Request human feedback on a reward function to refine it further.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| prompt | str | The prompt to present to the human for feedback | required | 
| idx | int | The index of the reward function in the memory | required | 
Returns:
| Name | Type | Description | 
|---|---|---|
| str | str | The updated prompt with human feedback included | 
            self_refine_reward(idx)
    Iteratively improve a reward function using self-refinement techniques.
This method implements an intelligent self-refinement process for reward functions by leveraging a Language Model (LLM) to analyze and improve the current function based on its previous performance.
Key Objectives
- Analyze current reward function performance
- Generate an improved version of the reward function
- Maintain the core task objectives while optimizing the reward signal
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| idx | int | Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation. | required | 
Returns:
| Name | Type | Description | 
|---|---|---|
| int | int | Index of the newly created refined reward function in the memory. | 
Refinement Process
- Construct a refinement prompt with:- Current reward function code
- Performance metrics
- Explicit refinement goals
 
- Generate a new reward function using LLM
- Compile and validate the new function
- Append the new function to memory
- Return the index of the new function
Refinement Goals
- Increase success rate of the policy
- Optimize the reward signal for better learning
- Preserve the original task objectives
- Improve overall performance
Notes
- Uses the existing memory to track function evolution
- Leverages LLM for intelligent function refinement
- Provides a systematic approach to reward function improvement
- Maintains a history of function iterations
            test_reward_func(reward_func)
    Test a reward function using the policy trainer.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| reward_func | str | The reward function to test | required | 
            video_description(prompt, idx)
    Request a video description from the user to refine a reward function.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| prompt | str | The prompt to present to the user for video description | required | 
| idx | int | The index of the reward function in the memory | required | 
Returns:
| Name | Type | Description | 
|---|---|---|
| str | The updated prompt with the video description included |