What I know from benchmarks I did:
The GCW Zero has an integer multiplication instruction, MULT/MULTU [unsigned], that is a different number of cycles from the integer division instruction, DIV/DIVU [unsigned]. Also, the time taken by DIV/DIVU differs according to the number of bits of the result (e.g. dividing 2,075,121,490 by 2 will take longer than dividing 2 by 5), among 11 to 60 cycles (so 11 to 60 nanoseconds, given that our CPU is 1 GHz). MULT and MULTU will always take 7 cycles.
Integer ADD and SUB are both 1 cycle.
I don't know much about the floating-point instructions, but I do know that the floating-point division instructions, DIV.S (32-bit) and DIV.D (64-bit), don't have the per-divisor timing that integer DIV/DIVU do.
Senor Quack did benchmarks on floating-point instructions, but I found out that they are dominated by the
memory access timings rather than by the operations themselves. I'll work with him to make sure the overhead becomes as minimal in his assembly code as in the 1/10 overhead loop in the Original Post.