Even though the loss drops pretty quickly, the testing condition needed for "done training" is pretty rigorous, so it might take a bit longer than expected. It generally converges quickly, but sometimes, it gets stuck in local minima, or even worse, it suffers from the vanishing gradient problem.
Credits to @Golden-Pixeleon . I just added some fixes and improvements that I felt like were needed.