Skip to content

Final project for Berkeley CS294-112 Deep reinforcement learning

Notifications You must be signed in to change notification settings

zihao-fan/PPO_improvement

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPO_improvement

CS294 final project

Project Description

  • Policy Gradients with Optimistic Value Functions
  • John Schulman
  • Policy gradient methods use value functions for variance reduction (e.g., see A3C or GAE). To obtain unbiased gradient estimates, the value function is chosen to approximate V^{\pi}, the value function of the current policy. There is reason to believe that we would obtain faster learning on many problems by instead using a value function that approximates V^, the optimal value function. You can fit V^ by using Q-learning (to fit Q^*) or simply by fitting V to satisfy the inequality V(s) <= empirical return after state s rather than the equality V(s) = empirical return after state.

Resources:

About

Final project for Berkeley CS294-112 Deep reinforcement learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages