VIRAL

`VIRAL`

`init(env_type, model_actor, model_critic, hf=False, vd=False, seed=None, training_time=25000, nb_vec_envs=1, legacy_training=True, options={}, proxies=None)`

Initialize VIRAL architecture for dynamic reward function generation

Parameters:

Name	Type	Description	Default
`env_type`	`EnvType`	Environment type for which the reward function is generated	required
`model_actor`	`str`	LLM model for reward function generation	required
`model_critic`	`str`	LLM model for reward function evaluation	required
`hf`	`bool`	Enable human feedback. Defaults to False.	`False`
`vd`	`bool`	Enable video description. Defaults to False.	`False`
`seed`	`int`	Random seed for training. Defaults to None.	`None`
`training_time`	`int`	Training time in seconds. Defaults to 25000.	`25000`
`nb_vec_envs`	`int`	Number of vectorized environments. Defaults to 1.	`1`
`legacy_training`	`bool`	Use legacy training. Defaults to True.	`True`
`options`	`dict`	LLM generation options. Defaults to {}.	`{}`
`proxies`	`dict`	Proxy configuration. Defaults to None.	`None`

`critical_refine_reward(idx)`

Refine a reward function that has critical performance issues. This method refines a reward function that has critical performance issues based on the evaluation results. It uses a Language Model (LLM) to generate a new reward function that addresses the identified issues.

Parameters:

Name	Type	Description	Default
`idx`	`int`	Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation.	required

Returns:

Name	Type	Description
`int`	`int`	Index of the newly created refined reward function in the memory.

`generate_context()`

Generate a context for the reward function generation process. This method uses a Language Model (LLM) to generate a context for the reward function generation process. The context includes information about the environment, task, and goal to be achieved.

`generate_reward_function(n_init=2, n_refine=1, focus='')`

Generate and iteratively improve a reward function using a Language Model (LLM).

This method implements a sophisticated reward function generation process that involves multiple stages of creation, evaluation, and refinement.

Key Stages

Initial Function Generation: Create two initial reward function candidates
Evaluation: Compare and identify the best and worst performing functions
Iterative Refinement: Progressively improve the worst-performing function

Parameters:

Name	Type	Description	Default
`task_description`	`str`	A detailed description of the task or environment for which the reward function is being generated.	required
`iterations`	`int`	Number of refinement iterations to perform. Defaults to 1.	required

Returns:

Type	Description
`list[State]`	list[State]: A list of generated and refined reward function states, containing information about each function's performance and implementation.

Process Overview

Generates two initial reward functions using an LLM
Evaluates these functions using a policy evaluation method
Selects the worst-performing function for refinement
Iteratively refines the function through self-refinement
Tracks the evolution of reward functions in the memory

Detailed Workflow

Generate two initial reward functions
- Uses a predefined prompt template
- Applies configurable LLM generation options
- Compiles and tests each generated function
Evaluates initial functions
- Identifies best and worst performing functions
Iterative Refinement
- Applies self-refinement to the worst-performing function
- Re-evaluates after each refinement
- Repeats for specified number of iterations

Note

Uses dynamic LLM configuration options
Supports flexible environment types
Provides a systematic approach to reward function generation
Logging at various stages for debugging and tracking

`human_feedback(prompt, idx)`

Request human feedback on a reward function to refine it further.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The prompt to present to the human for feedback	required
`idx`	`int`	The index of the reward function in the memory	required

Returns:

Name	Type	Description
`str`	`str`	The updated prompt with human feedback included

`self_refine_reward(idx)`

Iteratively improve a reward function using self-refinement techniques.

This method implements an intelligent self-refinement process for reward functions by leveraging a Language Model (LLM) to analyze and improve the current function based on its previous performance.

Key Objectives

Analyze current reward function performance
Generate an improved version of the reward function
Maintain the core task objectives while optimizing the reward signal

Parameters:

Name	Type	Description	Default
`idx`	`int`	Index of the reward function in the memory to be refined. Typically the worst-performing function from previous evaluation.	required

Returns:

Name	Type	Description
`int`	`int`	Index of the newly created refined reward function in the memory.

Notes

Uses the existing memory to track function evolution
Leverages LLM for intelligent function refinement
Provides a systematic approach to reward function improvement
Maintains a history of function iterations

`test_reward_func(reward_func)`

Test a reward function using the policy trainer.

Parameters:

Name	Type	Description	Default
`reward_func`	`str`	The reward function to test	required

`video_description(prompt, idx)`

Request a video description from the user to refine a reward function.

Parameters:

Name	Type	Description	Default
`prompt`	`str`	The prompt to present to the user for video description	required
`idx`	`int`	The index of the reward function in the memory	required

Returns:

Name	Type	Description
`str`		The updated prompt with the video description included

VIRAL

VIRAL

__init__(env_type, model_actor, model_critic, hf=False, vd=False, seed=None, training_time=25000, nb_vec_envs=1, legacy_training=True, options={}, proxies=None)

critical_refine_reward(idx)

generate_context()

generate_reward_function(n_init=2, n_refine=1, focus='')

human_feedback(prompt, idx)

self_refine_reward(idx)

test_reward_func(reward_func)

video_description(prompt, idx)

`VIRAL`

`init(env_type, model_actor, model_critic, hf=False, vd=False, seed=None, training_time=25000, nb_vec_envs=1, legacy_training=True, options={}, proxies=None)`

`critical_refine_reward(idx)`

`generate_context()`

`generate_reward_function(n_init=2, n_refine=1, focus='')`

`human_feedback(prompt, idx)`

`self_refine_reward(idx)`

`test_reward_func(reward_func)`

`video_description(prompt, idx)`