Skip to content

Rainbow #344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 312 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
312 commits
Select commit Hold shift + click to select a range
4c0f535
a
seann999 Aug 21, 2018
8adb23b
wow
seann999 Aug 21, 2018
c5d5957
a
seann999 Aug 21, 2018
e2565b2
a
seann999 Aug 22, 2018
6e9ed4c
a
seann999 Aug 22, 2018
e8e5583
a
seann999 Aug 22, 2018
43b240e
a
seann999 Aug 22, 2018
201c419
a
seann999 Aug 23, 2018
11704de
a
seann999 Aug 23, 2018
b95a89b
a
seann999 Aug 23, 2018
5c678ad
a
seann999 Aug 23, 2018
aea19d2
a
seann999 Aug 23, 2018
1c8e164
a
seann999 Aug 23, 2018
420e27b
myseq
seann999 Aug 23, 2018
d4a871c
a
seann999 Aug 23, 2018
ace9796
a
seann999 Aug 23, 2018
438fcc1
a
seann999 Aug 23, 2018
3a9608d
log
seann999 Aug 23, 2018
3e481e7
spec
seann999 Aug 23, 2018
45dcd19
specup
seann999 Aug 24, 2018
bbf297c
a
seann999 Aug 24, 2018
ad4cb31
a
seann999 Aug 26, 2018
292d3b7
a
seann999 Aug 26, 2018
cb5117a
a
seann999 Aug 26, 2018
9581771
a
seann999 Aug 26, 2018
7b1908c
a
seann999 Aug 27, 2018
0f7b273
a
seann999 Aug 27, 2018
42f913b
a
seann999 Aug 27, 2018
512d31e
a
seann999 Aug 29, 2018
d3fa393
a
seann999 Aug 30, 2018
1fc953d
a
seann999 Sep 3, 2018
18555cf
a
seann999 Sep 3, 2018
264a3fd
a
seann999 Sep 3, 2018
643666c
a
seann999 Sep 3, 2018
45f7fd6
a
seann999 Sep 3, 2018
0653ce6
a
seann999 Sep 3, 2018
438a082
fix
seann999 Sep 4, 2018
48871f3
fix
seann999 Sep 4, 2018
4ea1ca1
a
seann999 Sep 4, 2018
089481e
a
seann999 Sep 10, 2018
5ce42e4
yoo
seann999 Sep 10, 2018
337da8b
yoo
seann999 Sep 10, 2018
dbec0a0
a
seann999 Sep 10, 2018
4e99b0a
a
seann999 Sep 10, 2018
f139108
a
seann999 Sep 10, 2018
7d29428
a
seann999 Sep 11, 2018
3ba3903
a
seann999 Sep 11, 2018
b46ce2e
spec
seann999 Sep 11, 2018
9987597
a
seann999 Sep 11, 2018
b59c4a9
a
seann999 Sep 11, 2018
48281c1
a
seann999 Sep 11, 2018
3f93318
a
seann999 Sep 11, 2018
94d61ed
a
seann999 Sep 11, 2018
bc71682
a
seann999 Sep 12, 2018
7260ea8
huber
seann999 Sep 12, 2018
6fc7617
adds some functionality for n step transitions
prabhatnagarajan Sep 12, 2018
0908915
adds n step sampling from replay buffer
prabhatnagarajan Sep 12, 2018
9ee7bfa
fixes minor typo
prabhatnagarajan Sep 12, 2018
c8fffc0
a
seann999 Sep 12, 2018
9096b83
a
seann999 Sep 13, 2018
d5f303f
a
seann999 Sep 13, 2018
4d18d4d
adds num step returns argument
prabhatnagarajan Sep 14, 2018
a27abde
adds parseargs argument for number of steps to use in return value, p…
prabhatnagarajan Sep 14, 2018
864f252
a
seann999 Sep 16, 2018
b9e9a00
a
seann999 Sep 16, 2018
214be59
a
seann999 Sep 16, 2018
cab9965
a
seann999 Sep 16, 2018
ce21e7d
a
seann999 Sep 17, 2018
bee3d73
a
seann999 Sep 18, 2018
839e914
a
seann999 Sep 18, 2018
4b927a3
a
seann999 Sep 18, 2018
2d12b0e
adds nstep transition clipping before adding n transitions to replay …
prabhatnagarajan Sep 21, 2018
3ddb2e4
adds n step deep Q-learning, modifies batch_experiences to enable this
prabhatnagarajan Sep 21, 2018
f0e6b8f
fix
seann999 Sep 24, 2018
61cb175
a
seann999 Sep 24, 2018
1b4a7ab
a
seann999 Sep 24, 2018
ef94840
a
seann999 Sep 24, 2018
fe08a5d
a
seann999 Sep 24, 2018
663ca2c
a
seann999 Sep 24, 2018
24e2ee0
a
seann999 Sep 24, 2018
4df0ac3
a
seann999 Sep 25, 2018
2b37dde
a
seann999 Sep 25, 2018
736bfca
a
seann999 Sep 25, 2018
f987d53
a
seann999 Sep 25, 2018
f872f22
a
seann999 Sep 25, 2018
8aa9896
a
seann999 Sep 25, 2018
11e649e
a
seann999 Sep 25, 2018
8448258
a
seann999 Sep 25, 2018
ae9b2c7
a
seann999 Sep 25, 2018
6418af3
a
seann999 Sep 25, 2018
f9a4d95
a
seann999 Sep 25, 2018
fa4ba13
Merge branch 'master' into nstep
prabhatnagarajan Sep 26, 2018
cb63e76
adds SARSA agent to DQN example
prabhatnagarajan Sep 26, 2018
c15531b
Merge branch 'sarsa' into nstep
prabhatnagarajan Sep 26, 2018
55110d3
changes sarsa agent to new n-step format
prabhatnagarajan Sep 26, 2018
be5934e
converts several other agents to new exp_batch format
prabhatnagarajan Sep 27, 2018
7046629
added plots
seann999 Sep 27, 2018
fb707ee
spec
seann999 Sep 27, 2018
e71b92b
fixed bug
seann999 Sep 27, 2018
c591956
fixed bug
seann999 Sep 27, 2018
007f4ec
fixed bug
seann999 Sep 27, 2018
1c6134e
fix
seann999 Sep 28, 2018
edcb1f3
a
seann999 Sep 28, 2018
8eba91e
fix
seann999 Sep 28, 2018
be769b6
a
seann999 Sep 28, 2018
e011a2a
a
seann999 Sep 28, 2018
b48d217
a
seann999 Sep 28, 2018
cff0fb9
car
seann999 Sep 30, 2018
2fe528a
car
seann999 Sep 30, 2018
7395d7b
sets up stop current episode and makes dpp agent with new exp_batch
prabhatnagarajan Oct 1, 2018
484fee6
a
seann999 Oct 1, 2018
934717e
adds n steps to prioritized buffer and fixes merge conflicts
prabhatnagarajan Oct 1, 2018
4fd3f33
minor fixes
prabhatnagarajan Oct 1, 2018
f9af9fa
removes num_steps from init of prioritized replay buffer
prabhatnagarajan Oct 1, 2018
74ab311
a
seann999 Oct 1, 2018
d14cf3b
a
seann999 Oct 2, 2018
f15de04
makes fixes to prioritized replay buffer to be compatible with n-step…
prabhatnagarajan Oct 3, 2018
97f2f5d
idk
seann999 Oct 3, 2018
34bda1a
a
seann999 Oct 3, 2018
129ab91
a
seann999 Oct 3, 2018
7b94734
a
seann999 Oct 3, 2018
56d9aaf
a
seann999 Oct 3, 2018
416ed2c
a
seann999 Oct 4, 2018
e4fbf24
a
seann999 Oct 5, 2018
b887578
a
seann999 Oct 5, 2018
74f49e5
a
seann999 Oct 7, 2018
1293feb
a
seann999 Oct 8, 2018
1d85c29
a
seann999 Oct 8, 2018
4716009
a
seann999 Oct 8, 2018
ff929a5
a
seann999 Oct 9, 2018
936da1d
a
seann999 Oct 9, 2018
d972420
a
seann999 Oct 11, 2018
eecb049
a
seann999 Oct 13, 2018
a433257
fix
seann999 Oct 13, 2018
7ebb1a8
scaling
seann999 Oct 14, 2018
0835e70
a
seann999 Oct 14, 2018
1d13670
a
seann999 Oct 14, 2018
a766892
a
seann999 Oct 14, 2018
b3795e4
count
seann999 Oct 18, 2018
8d8f281
a
seann999 Oct 21, 2018
5cd8002
makes some changes to the tests for compatibility with nstep
prabhatnagarajan Oct 22, 2018
d92e9ff
Merge branch 'master' into nstep
prabhatnagarajan Oct 22, 2018
0e6b1f2
modifies two replay_buffer unit tests to accommodate new replay buffer
prabhatnagarajan Oct 23, 2018
96a64aa
addresses merge conflicts
prabhatnagarajan Oct 23, 2018
7ae0fd7
makes weights with replay buffer
prabhatnagarajan Oct 23, 2018
73d9768
fixes for adding weight to batch experiences
prabhatnagarajan Oct 23, 2018
4c75f83
changes weights back to original formulation with minor mods
prabhatnagarajan Oct 23, 2018
8d5f0c7
reverts to old style of sample method for prioritized replay
prabhatnagarajan Oct 24, 2018
980c662
fixes prioritized test
prabhatnagarajan Oct 24, 2018
6d379ff
modifies all replay_buffer tests to match new specs, everything passes
prabhatnagarajan Oct 24, 2018
74891e4
a
seann999 Oct 25, 2018
5e1e0eb
addresses a few flake issues
prabhatnagarajan Oct 26, 2018
936f341
more minor flake fixes
prabhatnagarajan Oct 26, 2018
db391ec
makes fix to episodic DQN to pass tests
prabhatnagarajan Oct 26, 2018
1be4d26
attempts testing code
prabhatnagarajan Oct 29, 2018
b31d5b1
adds n step toy domain tests
prabhatnagarajan Oct 30, 2018
a45be90
Merge branch 'master' into nstep
prabhatnagarajan Oct 31, 2018
8069723
a
seann999 Oct 31, 2018
040488e
Merge branch 'master' into nstep
prabhatnagarajan Oct 31, 2018
359befd
a
seann999 Oct 31, 2018
942df84
a
seann999 Nov 1, 2018
48e5f7f
a
seann999 Nov 1, 2018
706b6cd
nstep-ifies Categorical DQN
prabhatnagarajan Nov 1, 2018
21f7253
removes set trace
prabhatnagarajan Nov 1, 2018
a354dfd
fixes batch_experience call in pcl agent
prabhatnagarajan Nov 1, 2018
274f35f
a
seann999 Nov 1, 2018
d7fe4db
puts transitions in a list to be compatible with new batch experience…
prabhatnagarajan Nov 1, 2018
8b33168
makes another pcl fix
prabhatnagarajan Nov 1, 2018
1bf8932
mods ddpg code to us new batch_experiences
prabhatnagarajan Nov 1, 2018
166e889
a
seann999 Nov 1, 2018
f8b4735
a
seann999 Nov 2, 2018
451562d
a
seann999 Nov 2, 2018
f09e7d5
a
seann999 Nov 2, 2018
b69c160
a
seann999 Nov 2, 2018
b07c970
fixes flake issues
prabhatnagarajan Nov 2, 2018
630d80d
fixes error in prioritized arising from flake fix, fixes flakes in tests
prabhatnagarajan Nov 2, 2018
14e50cc
fixes flakes in examples
prabhatnagarajan Nov 2, 2018
286ce98
a
seann999 Nov 2, 2018
c53ec0c
a
seann999 Nov 2, 2018
33b9aca
a
seann999 Nov 2, 2018
9423ea9
applies autopep
prabhatnagarajan Nov 2, 2018
b02e833
a
seann999 Nov 2, 2018
a642136
init rainbow
seann999 Nov 3, 2018
8d9d0c8
Merge branch 'master' of https://github.com/chainer/chainerrl
seann999 Nov 3, 2018
93d87b9
init impl for all except n-step-return
seann999 Nov 3, 2018
254bce4
fix conflict
seann999 Nov 3, 2018
41fbd2a
compat w n-step
seann999 Nov 3, 2018
d026939
test fix
seann999 Nov 3, 2018
60e53c4
gpu fix
seann999 Nov 3, 2018
14d0144
gpu fix
seann999 Nov 3, 2018
8e31fec
gpu fix
seann999 Nov 3, 2018
e9661b7
gpu fix
seann999 Nov 3, 2018
5821a0c
softmax
seann999 Nov 3, 2018
3a64094
applied autopep
seann999 Nov 4, 2018
be1aff9
removed redundant example
seann999 Nov 4, 2018
e500ee4
undo breaking changes
seann999 Nov 4, 2018
9e3b909
fix flake8 errors
seann999 Nov 4, 2018
9d47391
fix docs
seann999 Nov 4, 2018
a3e6923
flake8 fix
seann999 Nov 4, 2018
e9f0b4f
flake8 fix
seann999 Nov 4, 2018
e39414e
a
seann999 Nov 4, 2018
a39a15c
a
seann999 Nov 5, 2018
d11951e
a
seann999 Nov 6, 2018
756c5ff
a
seann999 Nov 7, 2018
e7a205b
a
seann999 Nov 8, 2018
23ca9ad
a
seann999 Nov 8, 2018
5d76888
a
seann999 Nov 8, 2018
d5f87b1
a
seann999 Nov 8, 2018
085b668
a
seann999 Nov 8, 2018
afc5554
a
seann999 Nov 8, 2018
c0bd137
a
seann999 Nov 8, 2018
19cb31f
spec
seann999 Nov 8, 2018
861104f
spec
seann999 Nov 8, 2018
0d96f18
a
seann999 Nov 8, 2018
b4f1c47
a
seann999 Nov 8, 2018
dc7e045
a
seann999 Nov 8, 2018
789a3fe
a
seann999 Nov 8, 2018
a34618c
a
seann999 Nov 8, 2018
b92940a
a
seann999 Nov 8, 2018
dac3169
a
seann999 Nov 14, 2018
f3bbf61
a
seann999 Nov 15, 2018
a8efa30
a
seann999 Nov 15, 2018
4224cfa
a
seann999 Nov 16, 2018
d733f02
a
seann999 Nov 26, 2018
ab75eb2
a
seann999 Dec 2, 2018
a9be6ef
a
seann999 Dec 2, 2018
115e58c
a
seann999 Dec 2, 2018
895105b
a
seann999 Dec 2, 2018
05c4d23
a
seann999 Dec 2, 2018
59d167e
a
seann999 Dec 2, 2018
b5d13ec
a
seann999 Dec 2, 2018
fcb4255
a
seann999 Dec 4, 2018
a609225
a
seann999 Dec 4, 2018
7a07ae4
a
seann999 Dec 19, 2018
dc6c674
a
seann999 Dec 19, 2018
21660ad
a
seann999 Jan 7, 2019
9f0a5eb
a
seann999 Jan 7, 2019
0677399
a
seann999 Jan 7, 2019
f54add6
a
seann999 Jan 7, 2019
a87dd71
a
seann999 Jan 8, 2019
3e53847
mv
seann999 Jan 8, 2019
0ab831b
merge
seann999 Jan 16, 2019
c35ae13
merge
seann999 Jan 16, 2019
ffbc7ba
update spec
seann999 Jan 16, 2019
9e9fffe
rm egg
seann999 Jan 17, 2019
f165005
add ignore
seann999 Jan 17, 2019
04e95f7
add ignore
seann999 Jan 17, 2019
3249542
add ignore
seann999 Jan 17, 2019
46a9f32
rm files
seann999 Jan 17, 2019
3a2c2e8
rm junk
seann999 Jan 17, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
12 changes: 5 additions & 7 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
*.pyc
results
**/__pycache__
.ipynb_checkpoints
chainerrl.egg-info
build/
dist/
.idea/
results/
examples/gym/results/
build/lib/chainerrl
dist
*.ipynb
Empty file modified CONTRIBUTING.md
100644 → 100755
Empty file.
Empty file modified LICENSE
100644 → 100755
Empty file.
Empty file modified README.md
100644 → 100755
Empty file.
Empty file modified assets/ChainerRL.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified assets/breakout.gif
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified assets/humanoid.gif
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified chainerrl/__init__.py
100644 → 100755
Empty file.
Binary file added chainerrl/__pycache__/__init__.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/__init__.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/action_value.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/action_value.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/agent.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/agent.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/distribution.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/distribution.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/env.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/env.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/explorer.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/explorer.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/policy.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/policy.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/q_function.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/q_function.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/recurrent.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/recurrent.cpython-36.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added chainerrl/__pycache__/spaces.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/v_function.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/__pycache__/v_function.cpython-36.pyc
Binary file not shown.
74 changes: 74 additions & 0 deletions chainerrl/action_value.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,80 @@ def __getitem__(self, i):
self.q_values[i], q_values_formatter=self.q_values_formatter)


class DiscreteActionValueWithSigma(ActionValue):
"""Q-function output for discrete action space.

Args:
q_values (ndarray or chainer.Variable):
Array of Q values whose shape is (batchsize, n_actions)
"""

def __init__(self, q_values, sigma_values, all_sigmas=None, q_values_formatter=lambda x: x):
assert isinstance(q_values, chainer.Variable)
self.xp = cuda.get_array_module(q_values.data)
self.q_values = q_values
self.sigmas = sigma_values
self.n_actions = q_values.data.shape[1]
self.q_values_formatter = q_values_formatter
self.all_sigmas = all_sigmas

@cached_property
def greedy_actions(self):
return chainer.Variable(
self.q_values.data.argmax(axis=1).astype(np.int32))

@cached_property
def sample_actions(self):
noise = self.xp.random.standard_normal(self.sigmas.shape)
sig = self.xp.sqrt(self.xp.absolute(self.sigmas.data))
vals = self.q_values.data + sig * noise
return chainer.Variable(vals.argmax(axis=1).astype(np.int32))

def sample_actions_given_sigma(self, sigma):
noise = self.xp.random.standard_normal(self.sigmas.shape)
vals = self.q_values.data + sigma * noise
return chainer.Variable(vals.argmax(axis=1).astype(np.int32))

def sample_actions_given_noise(self, sigma):
vals = self.q_values.data + sigma
return chainer.Variable(vals.argmax(axis=1).astype(np.int32))

@cached_property
def max(self):
with chainer.force_backprop_mode():
return F.select_item(self.q_values, self.greedy_actions)

@cached_property
def max_sigma(self):
with chainer.force_backprop_mode():
return F.select_item(self.sigmas, self.greedy_actions)

def evaluate_actions(self, actions):
return F.select_item(self.q_values, actions)

def evaluate_action_sigmas(self, actions):
return F.select_item(self.sigmas, actions)

def compute_advantage(self, actions):
return self.evaluate_actions(actions) - self.max

def compute_double_advantage(self, actions, argmax_actions):
return (self.evaluate_actions(actions) -
self.evaluate_actions(argmax_actions))

def compute_expectation(self, beta):
return F.sum(F.softmax(beta * self.q_values) * self.q_values, axis=1)

def __repr__(self):
return 'DiscreteActionValue greedy_actions:{} q_values:{}'.format(
self.greedy_actions.data,
self.q_values_formatter(self.q_values.data))

@property
def params(self):
return (self.q_values,)


class DistributionalDiscreteActionValue(ActionValue):
"""distributional Q-function output for discrete action space.

Expand Down
Empty file modified chainerrl/agent.py
100644 → 100755
Empty file.
2 changes: 2 additions & 0 deletions chainerrl/agents/__init__.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from chainerrl.agents.a3c import A3C # NOQA
from chainerrl.agents.acer import ACER # NOQA
from chainerrl.agents.al import AL # NOQA
from chainerrl.agents.categorical_double_dqn import CategoricalDoubleDQN # NOQA
from chainerrl.agents.categorical_dqn import CategoricalDQN # NOQA
from chainerrl.agents.ddpg import DDPG # NOQA
from chainerrl.agents.double_dqn import DoubleDQN # NOQA
Expand All @@ -15,4 +16,5 @@
from chainerrl.agents.reinforce import REINFORCE # NOQA
from chainerrl.agents.residual_dqn import ResidualDQN # NOQA
from chainerrl.agents.sarsa import SARSA # NOQA
from chainerrl.agents.expected_sarsa import ExpectedSARSA # NOQA
from chainerrl.agents.trpo import TRPO # NOQA
Binary file not shown.
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/a3c.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/a3c.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/acer.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/acer.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/al.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/al.cpython-36.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/ddpg.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/ddpg.cpython-36.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/dpp.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/dpp.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/dqn.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/dqn.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/nsq.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/nsq.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/pal.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/pal.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/pcl.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/pcl.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/pgt.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/pgt.cpython-36.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/ppo.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/ppo.cpython-36.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/sarsa.cpython-35.pyc
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/sarsa.cpython-36.pyc
Binary file not shown.
Binary file not shown.
Binary file added chainerrl/agents/__pycache__/trpo.cpython-36.pyc
Binary file not shown.
Empty file modified chainerrl/agents/a3c.py
100644 → 100755
Empty file.
Empty file modified chainerrl/agents/acer.py
100644 → 100755
Empty file.
4 changes: 2 additions & 2 deletions chainerrl/agents/al.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def __init__(self, *args, **kwargs):
self.alpha = kwargs.pop('alpha', 0.9)
super().__init__(*args, **kwargs)

def _compute_y_and_t(self, exp_batch, gamma):
def _compute_y_and_t(self, exp_batch):

batch_state = exp_batch['state']
batch_size = len(exp_batch['reward'])
Expand All @@ -56,7 +56,7 @@ def _compute_y_and_t(self, exp_batch, gamma):
batch_terminal = exp_batch['is_state_terminal']

# T Q: Bellman operator
t_q = batch_rewards + self.gamma * \
t_q = batch_rewards + exp_batch['discount'] * \
(1.0 - batch_terminal) * next_q_max

# T_AL Q: advantage learning operator
Expand Down
49 changes: 49 additions & 0 deletions chainerrl/agents/categorical_double_dqn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from __future__ import unicode_literals
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
from future import standard_library
standard_library.install_aliases() # NOQA

import chainer

from chainerrl.agents import categorical_dqn
from chainerrl.agents.categorical_dqn import _apply_categorical_projection
from chainerrl.recurrent import state_kept


class CategoricalDoubleDQN(categorical_dqn.CategoricalDQN):
"""Categorical Double DQN.

"""

def _compute_target_values(self, exp_batch):
"""Compute a batch of target return distributions."""

batch_next_state = exp_batch['next_state']

with chainer.using_config('train', False), state_kept(self.q_function):
next_qout = self.q_function(batch_next_state)

target_next_qout = self.target_q_function(batch_next_state)

next_q_max = target_next_qout.evaluate_actions(
next_qout.greedy_actions)

batch_rewards = exp_batch['reward']
batch_terminal = exp_batch['is_state_terminal']
discount = exp_batch['discount']

batch_size = exp_batch['reward'].shape[0]
z_values = target_next_qout.z_values
n_atoms = z_values.size

# next_q_max: (batch_size, n_atoms)
next_q_max = target_next_qout.max_as_distribution.array
assert next_q_max.shape == (batch_size, n_atoms), next_q_max.shape

# Tz: (batch_size, n_atoms)
Tz = (batch_rewards[..., None]
+ (1.0 - batch_terminal[..., None]) * discount[..., None]
* z_values[None])
return _apply_categorical_projection(Tz, next_q_max, z_values)
78 changes: 68 additions & 10 deletions chainerrl/agents/categorical_dqn.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
standard_library.install_aliases() # NOQA

import chainer
from chainer import cuda
import chainer.functions as F
import numpy as np

Expand Down Expand Up @@ -72,6 +73,51 @@ def _apply_categorical_projection(y, y_probs, z):
return z_probs


def compute_value_loss(y, t, batch_accumulator='mean'):
"""Compute a loss for value prediction problem.

Args:
y (Variable or ndarray): Predicted values.
t (Variable or ndarray): Target values.
batch_accumulator (str): 'mean' or 'sum'. 'mean' will use the mean of
the loss values in a batch. 'sum' will use the sum.
Returns:
(Variable) scalar loss
"""
assert batch_accumulator in ('mean', 'sum')

eltwise_loss = -t * F.log(F.clip(y, 1e-10, 1.))

if batch_accumulator == 'sum':
loss = F.sum(eltwise_loss)
else:
loss = F.mean(F.sum(eltwise_loss, axis=1))
return loss


def compute_weighted_value_loss(y, t, weights, batch_accumulator='mean'):
"""Compute a loss for value prediction problem.

Args:
y (Variable or ndarray): Predicted values.
t (Variable or ndarray): Target values.
weights (ndarray): Weights for y, t.
batch_accumulator (str): 'mean' will devide loss by batchsize
Returns:
(Variable) scalar loss
"""
assert batch_accumulator in ('mean', 'sum')

eltwise_loss = -t * F.log(F.clip(y, 1e-10, 1.))

loss = F.sum(eltwise_loss * weights[:, None, None])
if batch_accumulator == 'sum':
loss = F.sum(eltwise_loss)
else:
loss = F.mean(F.sum(eltwise_loss, axis=1))
return loss


class CategoricalDQN(dqn.DQN):
"""Categorical DQN.

Expand All @@ -81,7 +127,7 @@ class CategoricalDQN(dqn.DQN):
DistributionalDiscreteActionValue and clip_delta is ignored.
"""

def _compute_target_values(self, exp_batch, gamma):
def _compute_target_values(self, exp_batch):
"""Compute a batch of target return distributions."""

batch_next_state = exp_batch['next_state']
Expand All @@ -100,10 +146,12 @@ def _compute_target_values(self, exp_batch, gamma):

# Tz: (batch_size, n_atoms)
Tz = (batch_rewards[..., None]
+ (1.0 - batch_terminal[..., None]) * gamma * z_values[None])
+ (1.0 - batch_terminal[..., None])
* self.xp.expand_dims(exp_batch['discount'], 1)
* z_values[None])
return _apply_categorical_projection(Tz, next_q_max, z_values)

def _compute_y_and_t(self, exp_batch, gamma):
def _compute_y_and_t(self, exp_batch):
"""Compute a batch of predicted/target return distributions."""

batch_size = exp_batch['reward'].shape[0]
Expand All @@ -120,19 +168,29 @@ def _compute_y_and_t(self, exp_batch, gamma):
assert batch_q.shape == (batch_size, n_atoms)

with chainer.no_backprop_mode():
batch_q_target = self._compute_target_values(exp_batch, gamma)
batch_q_target = self._compute_target_values(exp_batch)
assert batch_q_target.shape == (batch_size, n_atoms)

return batch_q, batch_q_target

def _compute_loss(self, exp_batch, gamma, errors_out=None):
def _compute_loss(self, exp_batch, errors_out=None):
"""Compute a loss of categorical DQN."""
y, t = self._compute_y_and_t(exp_batch, gamma)
y, t = self._compute_y_and_t(exp_batch)
# Minimize the cross entropy
# y is clipped to avoid log(0)
eltwise_loss = -t * F.log(F.clip(y, 1e-10, 1.))
if self.batch_accumulator == 'sum':
loss = F.sum(eltwise_loss)

if errors_out is not None:
del errors_out[:]
delta = F.sum(eltwise_loss, axis=1)
delta = cuda.to_cpu(delta.array)
for e in delta:
errors_out.append(e)

if 'weights' in exp_batch:
return compute_weighted_value_loss(
y, t, exp_batch['weights'],
batch_accumulator=self.batch_accumulator)
else:
loss = F.mean(F.sum(eltwise_loss, axis=1))
return loss
return compute_value_loss(y, t,
batch_accumulator=self.batch_accumulator)
4 changes: 2 additions & 2 deletions chainerrl/agents/ddpg.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,7 @@ def compute_actor_loss(self, batch):
def update(self, experiences, errors_out=None):
"""Update the model from experiences"""

batch = batch_experiences(experiences, self.xp, self.phi)
batch = batch_experiences(experiences, self.xp, self.phi, self.gamma)
self.critic_optimizer.update(lambda: self.compute_critic_loss(batch))
self.actor_optimizer.update(lambda: self.compute_actor_loss(batch))

Expand All @@ -273,7 +273,7 @@ def update_from_episodes(self, episodes, errors_out=None):
break
transitions.append(ep[i])
batch = batch_experiences(
transitions, xp=self.xp, phi=self.phi)
transitions, xp=self.xp, phi=self.phi, gamma=self.gamma)
batches.append(batch)

with self.model.state_reset(), self.target_model.state_reset():
Expand Down
5 changes: 3 additions & 2 deletions chainerrl/agents/double_dqn.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ class DoubleDQN(dqn.DQN):
See: http://arxiv.org/abs/1509.06461.
"""

def _compute_target_values(self, exp_batch, gamma):
def _compute_target_values(self, exp_batch):

batch_next_state = exp_batch['next_state']

Expand All @@ -31,5 +31,6 @@ def _compute_target_values(self, exp_batch, gamma):

batch_rewards = exp_batch['reward']
batch_terminal = exp_batch['is_state_terminal']
discount = exp_batch['discount']

return batch_rewards + self.gamma * (1.0 - batch_terminal) * next_q_max
return batch_rewards + discount * (1.0 - batch_terminal) * next_q_max
4 changes: 2 additions & 2 deletions chainerrl/agents/double_pal.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

class DoublePAL(pal.PAL):

def _compute_y_and_t(self, exp_batch, gamma):
def _compute_y_and_t(self, exp_batch):

batch_state = exp_batch['state']
batch_size = len(exp_batch['reward'])
Expand Down Expand Up @@ -45,7 +45,7 @@ def _compute_y_and_t(self, exp_batch, gamma):
batch_terminal = exp_batch['is_state_terminal']

# T Q: Bellman operator
t_q = batch_rewards + self.gamma * \
t_q = batch_rewards + exp_batch['discount'] * \
(1.0 - batch_terminal) * next_q_max

# T_PAL Q: persistent advantage learning operator
Expand Down
8 changes: 4 additions & 4 deletions chainerrl/agents/dpp.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ class AbstractDPP(with_metaclass(ABCMeta, DQN)):
def _l_operator(self, qout):
raise NotImplementedError()

def _compute_target_values(self, exp_batch, gamma):
def _compute_target_values(self, exp_batch):

batch_next_state = exp_batch['next_state']

Expand All @@ -37,9 +37,9 @@ def _compute_target_values(self, exp_batch, gamma):
batch_terminal = exp_batch['is_state_terminal']

return (batch_rewards +
self.gamma * (1 - batch_terminal) * next_q_expect)
exp_batch['discount'] * (1 - batch_terminal) * next_q_expect)

def _compute_y_and_t(self, exp_batch, gamma):
def _compute_y_and_t(self, exp_batch):

batch_state = exp_batch['state']
batch_size = len(exp_batch['reward'])
Expand All @@ -65,7 +65,7 @@ def _compute_y_and_t(self, exp_batch, gamma):

# r + g * LQ'(s_{t+1},a)
batch_q_target = F.reshape(
self._compute_target_values(exp_batch, gamma), (batch_size, 1))
self._compute_target_values(exp_batch), (batch_size, 1))

# Q'(s_t,a_t) + r + g * LQ'(s_{t+1},a) - LQ'(s_t,a)
t = target_q + batch_q_target - target_q_expect
Expand Down
Loading