blas/gonum: Move scaling of C into the loop
Created by: btracey
The Dgemm code right now scales the data in C, and then performs C += alpha * A * B. This makes two passes over the data in C, and is also serial. The beta should be moved into the loops like it is in the reference implementation.