Please run on Turbowarp: turbowarp.org/469249399/fullscreen A very simple test for instruction-level parallelism (ILP). Parallelism = 2, Unrolling = 1 seems to be the best option in general. Results may vary by machine. For example, on an Intel laptop (i7-1065G7), parallelism seems to make very little difference, if any. On an AMD machine (Ryzen 5 5600X), parallelism makes a significant difference (P=2, U=1 2975 ms, P=1, U=1 3349, parallelism is ~12% faster) and unrolling seems to reduce performance slightly. On an iPhone XS, parallelism helps (P=2, U=1 6220 ms, P=1, U=4 6868 ms, parallelism is ~10% faster), and loop unrolling only helps when going from P=1, U=1 (7576 ms) to P=1, U=4 (6868 ms, ~10% faster). Scratch seems to be too high-level for programs to take advantage of ILP.