Meta-Algorithms¶

MAML-Algorithm (Interface)¶

class meta_policy_search.meta_algos.MAMLAlgo(policy, inner_lr=0.1, meta_batch_size=20, num_inner_grad_steps=1, trainable_inner_step_size=False)[source]¶

Bases: meta_policy_search.meta_algos.base.MetaAlgo

Provides some implementations shared between all MAML algorithms

Parameters:	policy (Policy) – policy object inner_lr (float) – gradient step size used for inner step meta_batch_size (int) – number of meta-learning tasks num_inner_grad_steps (int) – number of gradient updates taken per maml iteration trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable

build_graph()¶

Creates meta-learning computation graph

Pseudocode:

for task in meta_batch_size:
    make_vars
    init_dist_info_sym
for step in num_grad_steps:
    for task in meta_batch_size:
        make_vars
        update_dist_info_sym
set objectives for optimizer

make_vars(prefix='')¶

Parameters:	prefix (str) – a string to prepend to the name of each variable
Returns:	a tuple containing lists of placeholders for each input type and meta task
Return type:	(tuple)

optimize_policy(all_samples_data, log=True)¶

Performs MAML outer step for each task

Parameters:	all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task log (bool) – whether to log statistics
Returns:	None

ProMP-Algorithm¶

class meta_policy_search.meta_algos.ProMP(*args, name='ppo_maml', learning_rate=0.001, num_ppo_steps=5, num_minibatches=1, clip_eps=0.2, target_inner_step=0.01, init_inner_kl_penalty=0.01, adaptive_inner_kl_penalty=True, anneal_factor=1.0, **kwargs)[source]¶

Bases: meta_policy_search.meta_algos.base.MAMLAlgo

ProMP Algorithm

Parameters:

policy (Policy) – policy object
name (str) – tf variable scope
learning_rate (float) – learning rate for optimizing the meta-objective
num_ppo_steps (int) – number of ProMP steps (without re-sampling)
num_minibatches (int) – number of minibatches for computing the ppo gradient steps
clip_eps (float) – PPO clip range
target_inner_step (float) – target inner kl divergence, used only when adaptive_inner_kl_penalty is true
init_inner_kl_penalty (float) – initial penalty for inner kl
adaptive_inner_kl_penalty (bool) – whether to used a fixed or adaptive kl penalty on inner gradient update
anneal_factor (float) – multiplicative factor for annealing clip_eps. If anneal_factor < 1, clip_eps <- anneal_factor * clip_eps at each iteration
inner_lr (float) – gradient step size used for inner step
meta_batch_size (int) – number of meta-learning tasks
num_inner_grad_steps (int) – number of gradient updates taken per maml iteration
trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable

build_graph()[source]¶: Creates the computation graph

make_vars(prefix='')¶

Parameters:	prefix (str) – a string to prepend to the name of each variable
Returns:	a tuple containing lists of placeholders for each input type and meta task
Return type:	(tuple)

optimize_policy(all_samples_data, log=True)[source]¶

Performs MAML outer step

Parameters:	all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task log (bool) – whether to log statistics
Returns:	None

TRPO-MAML-Algorithm¶

class meta_policy_search.meta_algos.TRPOMAML(*args, name='trpo_maml', step_size=0.01, inner_type='likelihood_ratio', exploration=False, **kwargs)[source]¶

Bases: meta_policy_search.meta_algos.base.MAMLAlgo

Algorithm for TRPO MAML

Parameters:

policy (Policy) – policy object
name (str) – tf variable scope
step_size (int) – trust region size for the meta policy optimization through TPRO
inner_type (str) – One of ‘log_likelihood’, ‘likelihood_ratio’, ‘dice’, choose which inner update to use
exploration (bool) – whether to use E-MAML or MAML
inner_lr (float) – gradient step size used for inner step
meta_batch_size (int) – number of meta-learning tasks
num_inner_grad_steps (int) – number of gradient updates taken per maml iteration
trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable

build_graph()[source]¶: Creates the computation graph

make_vars(prefix='')¶

Parameters:	prefix (str) – a string to prepend to the name of each variable
Returns:	a tuple containing lists of placeholders for each input type and meta task
Return type:	(tuple)

optimize_policy(all_samples_data, log=True)[source]¶

Performs MAML outer step

Parameters:	all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task log (bool) – whether to log statistics
Returns:	None

VPG-MAML-Algorithm¶

class meta_policy_search.meta_algos.VPGMAML(*args, name='vpg_maml', learning_rate=0.001, inner_type='likelihood_ratio', exploration=False, **kwargs)[source]¶

Bases: meta_policy_search.meta_algos.base.MAMLAlgo

Algorithm for PPO MAML

Parameters:

policy (Policy) – policy object
name (str) – tf variable scope
learning_rate (float) – learning rate for the meta-objective
exploration (bool) – use exploration / pre-update sampling term / E-MAML term
inner_type (str) – inner optimization objective - either log_likelihood or likelihood_ratio
inner_lr (float) – gradient step size used for inner step
meta_batch_size (int) – number of meta-learning tasks
num_inner_grad_steps (int) – number of gradient updates taken per maml iteration
trainable_inner_step_size (boolean) – whether make the inner step size a trainable variable

build_graph()[source]¶: Creates the computation graph

make_vars(prefix='')¶

Parameters:	prefix (str) – a string to prepend to the name of each variable
Returns:	a tuple containing lists of placeholders for each input type and meta task
Return type:	(tuple)

optimize_policy(all_samples_data, log=True)[source]¶

Performs MAML outer step

Parameters:	all_samples_data (list) – list of lists of lists of samples (each is a dict) split by gradient update and meta task log (bool) – whether to log statistics
Returns:	None