AI agent solves CartPole and LunarLander environments in the OpenAi gym using vanilla policy gradient method. The agent uses average rewards as a baseline.
Training :
- It uses monte-carlo method for learning. (Agent waits till the end of eposides to learn).
- During an episode trajectory of state, action, rewards are stored. At the end of an episode the neural network approximates the probablity distribution of actions for the states in trajectory.
- Loss is calculated with the sum of producs of the log probablity of action with discounted rewards in the trajectory.
Agent with baseline performed better than the agent without baseline.
| Policy Gradient CartPole -v1 | Policy Gradient Baseline Vs NoBaseline |
|---|---|
![]() |
![]() |
command line arguments :
--env environment (CartPole-v1 or LunarLander-v2)
--learn training the agent
--play to make the agent play with the environment
-ep number of episodes to play to train
-g discount factor gamma
-lr learning rate
- To training the agent : run
python agent.py --env LunarLander-v2 --learn -ep 1000 - To play : run
python agent.py --env LunarLander-v2 --play -ep 5


