This is a simple implementation of a deterministic reinforcement learning in a Gridworld setting. The Cat is trying to get the Cake while avoiding the Bear. **If training takes too long try turning on `Turbo Mode` (shift + Green Flag to toggle) ----------------------GETTING STARTED---------------------- 1. Press `t` to train the agent then `p` to let the agent use what it has learned to find the goal. 2. Play around by changing the position of existing tiles or adding new tiles. (see CONTROLS) 3. Play around by changing variables i.e. board size, learning rate, exploration rate. (go into code) ------------------------------------------------------------ ------------------------CONTROLS------------------------ - Green Flag: reset everything (including learned state rewards) - t: TRAIN agent - p: PLAY game (after training) - r: prematurely end training or playing -s: show state rewards -u: reset state rewards - mouse click: place tile during SETUP according to mouse_mode - 1: set mouse_mode to normal walkable tile - 2: set mouse_mode to blocked tile - 3: set mouse_mode to start tile - 4: set mouse_mode to win tile - 5: set mouse_mode to lose tile ------------------------------------------------------------ -----------------SETUP INFO----------------- - Only one win tile - Any number of lose tiles (min 1) - Any number of block tiles (min 0) - You need to change tiles to normal before changing it to another type ------------------------------------------------------------ -----------------LEARNING PARAMETERS----------------- - LEARNING_RATE [0, 1]: how much to update state value (0 means no learning) - EXPLORE_RATE [0, 1]: how much to randomly explore (0 means no exploration) - ROUNDS: How many rounds to train the agent ------------------------------------------------------------ ------------------------NOTES----------------------- - Method used here is Temporal Difference Q Learning - Grids are created by Cloning which has a maximum limit of 300 so there can be at most 300 tiles - If you want to see what's happening when it learns observe the `state_reward` list. Each item in the list corresponds to how `good` going to that state (linearly indexed) would be. ------------------------------------------------------------
This model trains for 100 rounds, rather than the 10 rounds of training in @aaronlws95's code.