In 3(b) what does "behaviour of a converged policy" mean?

Post date: Nov 10, 2011 4:28:7 PM

For the purposes of the exercise you can assume that policy learning has converged after it has been learning for 20,000,000 time-steps. If you playback the policy evaluation which occurred at the 20,000,000th time-step you can watch how the "converged" policy is behaving.

The question is asking for a qualitative description of the policy behaviour, e.g. is it conserving energy, maximising production or reached some compromise. Also describe how that behaviour is actually realise, e.g. which production lines are running, are they running flat-out, are many units being held in the waiting area, etc., etc.