I have been digging into your paper and code, and noticed some potential discrepancies between the paper and the code. I would appreciate it very much if you could clarify.
- in training.py line 84
for i in range(batch_mol.shape[0]):
# If molecule was modified
if not np.all(org_mols[i] == batch_mol[i]):
fr = evaluate_mol(batch_mol[i], e, decodings)
frs.append(fr)
rewards[i] += np.sum(fr * dist)
isn't the reward updated by rewards[i] += np.sum(fr * 1/(1+dist)) according to the paper(Eq. 12)?
-
In models.py Actor and Critic have different learning rates (0.0005 and 0.0001), but the paper mentions that they both have the same learning rate (page 19). I would like to know the reason for different learning rates, and if there are any more parameter changes, would appreciate it if you could tell me.
-
in training.py line 112
for i in range(BATCH_SIZE):
a = int(actions[i])
loss = -np.log(probs[i,a]) * td_error[i]
target_actor[i,a] = td_error[i]
The variable loss is not used anywhere. From https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/ I see that the policy update should include the logarithm. So should the target_actor[i,a] = loss be the correct update?
I have been digging into your paper and code, and noticed some potential discrepancies between the paper and the code. I would appreciate it very much if you could clarify.
isn't the reward updated by
rewards[i] += np.sum(fr * 1/(1+dist))according to the paper(Eq. 12)?In models.py Actor and Critic have different learning rates (0.0005 and 0.0001), but the paper mentions that they both have the same learning rate (page 19). I would like to know the reason for different learning rates, and if there are any more parameter changes, would appreciate it if you could tell me.
in training.py line 112
The variable
lossis not used anywhere. From https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/ I see that the policy update should include the logarithm. So should thetarget_actor[i,a] = lossbe the correct update?