It relies on some SSE2 instructions, so performance gain is not that
huge (about 1.4x).
I experimented with 256-bit loads, but they turned out to be slower (at
least on Sandy Bridge).
Overall, I'm very happy with performance of this code, but not so much
with resulting image. It seems like integer approximations won't do.
I might remove this code altogether, so I didn't update comments.
Calculations are done on integer, rather than floating point numbers,
so this implementation is not as accurate (but when scale factor is
reasonable enough, no artifacs are visible).
It is, however, faster by a factor of ~3.